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Introduction to the Special Section on Computer-Based Assessment of 
Cross-Curricular Skills and Processes 


Samuel Greiff and Romain Martin 
University of Luxembourg 


Birgit Spinath 
Heidelberg University 


Keywords: computer-based assessment, cross-curricular skills, behavioral processes, domain-general 


This special section presents a collection of articles that were 
submitted to Journal of Educational Psychology in response to a 
call for papers on computer-based assessment of cross-curricular 
skills and processes. The development of innovative computer- 
based assessment instruments that target cross-curricular skills and 
processes and the validation of these instruments within educa- 
tional psychology has been a field of ongoing scientific inquiry 
with substantial research activity in recent years. After a selective 
and stringent peer-review process, this special section includes six 
articles that report cutting-edge research and present a cross- 
section of different topics. 

Why is a special section on computer-based assessment of 
cross-curricular skills and processes both timely and important to 
researchers interested in the field of educational psychology? 
There are a number of good reasons, but a major reason is that the 
cognitive and interpersonal skills necessary for successful partic- 
ipation in society have undergone great changes in recent decades. 
Studies have shown that tasks at school, university, and work have 
become more demanding and less bound to single-subject matters 
or domains (e.g., Autor, Levy, & Murnane, 2003; Spitz-Oener, 
2006). Tasks now more often involve cross-curricular, nonroutine, 
and complex skills and processes (e.g., problem solving) that are 
applicable in diverse situations and content areas (Greiff et al., 
2013; Hautamiki et al., 2002). Mayer and Wittrock (2006) high- 
lighted the importance of problem solving as one prime example of 
a cross-curricular skill for educational psychologists. In fact, they 
proposed that helping students become better problem solvers is 
one of the greatest challenges in educational psychology. 

Consequently, the development of assessment instruments to 
measure cross-curricular skills and processes as well as their 
validation has been an ongoing field of inquiry in psychometrics 
and educational psychology alike. However, the dynamic and 
interactive nature of these skills implies that their assessment may 
not lie within reach of classical paper-and-pencil instruments. 
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Fortunately, the advent of computers in virtually any setting of 
educational assessment has allowed for the emergence of innova- 
tive assessment procedures. In addition to offering increased flex- 
ibility, computer-administered tests record log-file data during task 
execution, thus providing further insight into behavioral processes 
that are not captured by final performance data. For instance, time 
on task may be used to better understand how students become 
involved in a proposed task or to yield information about the type 
and quality of cognitive processing that occurs while students 
work on an educational assessment. 

From an applied perspective, computer-based assessment envi- 
ronments and the use of computer-generated log-file data are now 
found in a large range of educational settings, including interna- 
tional large-scale assessments such as the Programme for Interna- 
tional Student Assessment (PISA) and the Programme for the 
International Assessment of Adult Competencies (PIAAC). Cross- 
curricular skills have become integral parts of the assessment 
framework in interactive problem solving (Organisation for Eco- 
nomic Co-operation and Development [OECD], 2010), collabor- 
ative problem solving (OECD, 2012), problem solving in 
technology-rich environments (OECD, 2009), and electronic read- 
ing assessment (OECD, 2011). In addition, process data are now 
implemented in the scoring procedures of large-scale educational 
assessments, for example, to correct for obvious guessing as iden- 
tified through a lack of the required exploration behavior or to 
integrate behavioral data as potential performance indicators that 
go beyond merely scoring the number of correct answers. 

As a consequence, research concerning the setup and use of 
computer-based assessment instruments in educational contexts is 
quickly emerging and has great relevance to researchers and prac- 
titioners alike. This special section in Journal of Educational 
Psychology pays tribute to the general need for rigorous empirical 
research in this field. This need is illustrated through the assess- 
ment of cross-curricular skills in particular, stressing the impor- 
tance of developing a theoretical understanding of these skills and 
the added value of computerized assessments gained through the 
setup of interactive and complex assessment environments and the 
use of log-file data. This special section is composed of articles 
that report on the development of theoretically sound and scien- 
tifically validated assessment instruments for cross-curricular 
skills and on the benefits of methodological advances associated 
with computer-based assessment, such as the benefits of log-file 
analyses for assessing classical and cross-curricular cognitive abil- 
ities. Articles are related to assessment issues in education, and 
some of them exhibit strong ties to international or national large- 
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scale efforts. For example, some contributions uncover behavioral 
interaction patterns that are not reflected in final performance data 
and relate these patterns to psychological theories and relevant 
educational outcomes within a large-scale assessment. Thus, the 
common denominator of all articles published in the special sec- 
tion is that they exploit the computer in an innovative manner and 
substantially widen the scope of our view on students’ skills. Much 
of the research in this special section is embedded in the context of 
large-scale educational assessments, but some of it is experimental 
or draws on selective subgroups of student populations. 

In the first contribution (Goldhammer, Naumann, Stelter, Toth, 
Rolke, & Klieme, 2014), the authors provided insights into behay- 
ioral processes that, until recently, were exclusively addressed in 
experimental settings. They elaborated on differential effects of the 
meaning of time on task in problem solving and in reading based 
on a representative German sample from the PIAAC field trial 
data. In a related way, the second contribution (Kupiainen, 
Vainikainen, Marjanen, & Hautamaki, 2014) investigated the role 
of time on task as an indicator of students’ investment and the 
subsequent effect on students’ achievement. Both articles make 
use of log files and show how this approach not only broadens the 
understanding of assessment but also advances theory in the field 
of educational psychology and may ultimately lead to the devel- 
opment of interventions that can be used in the classroom. 

The third contribution (Csapo, Molnar, & Nagy, 2014) demon- 
strates that psychometric properties can be optimized through 
computer-based test delivery even in an assessment of school 
readiness at a very young age. This illustrates that computers as 
assessment instruments can be used across a variety of age 
groups—a topic with very limited information until now. The 
fourth contribution (Ifenthaler, 2014) shows how computers can be 
used not only to collect large amounts of data but also to process 
and to automatically score the data in the context of team-based 
processes and performance. By doing so, Ifenthaler investigated 
team effectiveness, an area that is very relevant to the assessment 
of cooperation and collaboration in large-scale educational assess- 
ments. Collaborative problem solving will be assessed in the PISA 
2015 cycle. 

The fifth contribution (Greiff, Kretzschmar, Miiller, Spinath, & 
Martin, 2014) addressed complex problem solving, a phenomenon 
particularly relevant to cross-curricular skills. The authors related 
complex problem solving to intelligence and computer skills and 
showed in three different samples that the added value of complex 
problem solving cannot be traced back to an indirect assessment of 
computer skills. In fact, the added value seems to originate from 
complex cognitive processes associated with computer-simulated 
problem solving tasks. Further investigating the skill of complex 
problem solving, the sixth contribution (Sonnleitner, Brunner, Keller, 
& Martin, 2014) reported that computer-based simulations of com- 
plex cognitive processes may be less influenced than paper-and-pencil 
tests of intelligence by students’ cultural backgrounds, thus yielding a 
less biased and fairer assessment of cognitive skills for disadvantaged 
groups or minorities. These two articles provide strong empirical 
support for the claim that computer-based assessment allows psychol- 
ogists to widen their scope to new cognitive constructs that cannot be 
accessed via classical paper-and-pencil-based measures. 

These six contributions in the special section span a wide array 
of different topics. They do not cover all topics relevant in the 
field, but they do cover important topics for scientists interested in 


new developments in educational psychology. All articles in this 
special section were subjected to the normal rigorous process of 
anonymous peer review. Articles in which one of the guest editors 
was involved as a contributing author were not reviewed or edited 
by any of the other guest editors. Editorship and authorship were 
strictly separated in accordance with American Psychological As- 
sociation guidelines. 

Over 20 years ago, Bunderson, Inouye, and Olsen (1989) pre- 
dicted that new generations of computer-based assessment instru- 
ments would swiftly evolve along with a rapid decline in paper- 
and-pencil testing. From today’s perspective, the shift toward a 
new generation of tests that allow for the assessment of more 
general and transversal skills and that exploit process data as a 
standard procedure has been much slower than was anticipated in 
the late 1980s. Considering how long computers have been avail- 
able, Williamson, Bejar, and Mislevy (2006) observed that the 
exploitation of the potential that lay in computer-based assessment 
has been slower than expected. However, assessment is now at a 
transition point where many former barriers, such as availability of 
computer equipment in the classroom or the general level of 
computer literacy (cf. digital natives; Prensky, 2001), have become 
less relevant. As a consequence, computer-based assessment is 
widely available and accepted today, even if its potential for added 
value still needs to be established and fostered. The contributions 
in this special section are committed to advancing the knowledge 
of computer-based assessment in educational contexts. In doing so, 
our goal of this special section is to enhance the understanding of 
the assessment of cross-curricular skills and processes at different 
educational levels and to explain the process of skill acquisition by 
developing and testing adequate models. We sincerely hope that 
you enjoy reading this special section in the Journal of Educa- 
tional Psychology. 
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Computer-based assessment can provide new insights into behavioral processes of task completion that 
cannot be uncovered by paper-based instruments. Time presents a major characteristic of the task 
completion process. Psychologically, time on task has 2 different interpretations, suggesting opposing 
associations with task outcome: Spending more time may be positively related to the outcome as the task 
is completed more carefully. However, the relation may be negative if working more fluently, and thus 
faster, reflects higher skill level. Using a dual processing theory framework, the present study argues that 
the validity of each assumption is dependent on the relative degree of controlled versus routine cognitive 
processing required by a task, as well as a person’s acquired skill. A total of 1,020 persons ages 16 to 
65 years participated in the German field test of the Programme for the International Assessment of Adult 
Competencies. Test takers completed computer-based reading and problem solving tasks. As revealed by 
linear mixed models, in problem solving, which required controlled processing, the time on task effect 
was positive and increased with task difficulty. In reading tasks, which required more routine processing, 
the time on task effect was negative and the more negative, the easier a task was. In problem solving, the 
positive time on task effect decreased with increasing skill level. In reading, the negative time on task 
effect increased with increasing skill level. These heterogeneous effects suggest that time on task has no 
uniform interpretation but is a function of task difficulty and individual skill. 


Keywords: computer-based assessment, time on task, automatic and controlled processing, reading 
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There are two fundamental observations on human perfor- 
mance: the result obtained on a task and the time taken (e.g., Ebel, 
1953). In educational assessment, the focus is mainly on the task 
outcome; behavioral processes that led to the result are usually not 
considered. One reason may be that traditional assessments are 
paper-based and, hence, are not suitable for collecting behavioral 
process data at the task level (cf. Scheuermann & Byj6rnsson, 
2009). However, computer-based assessment— besides other ad- 
vantages, such as increased construct validity (e.g., Sireci & Ze- 
nisky, 2006) or improved test design (e.g., van der Linden, 
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2005)—can provide further insights into the task completion pro- 
cess. This is because in computer-based assessment, log file data 
can be recorded by the assessment system that allows the re- 
searcher to derive theoretically meaningful descriptors of the task 
completion process. The present study draws on log file data from 
an international computer-based large-scale assessment to address 
the question of how time on task is related to the task outcome. As 
shown in the following, by analyzing the relation of task perfor- 
mance to the time test takers spent on task, we were able to obtain 
new insights into how the interaction of task and person charac- 
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teristics determines the way of cognitive processing. For instance, 
this can contribute to the validation of the assessment, if time on 
task can be related to the task response in a theoretically sound 
Way. 

Time on task is an important characteristic of the solution 
process indicating the duration of perceptual, cognitive, and psy- 
chomotorical activities. From a measurement point of view, the 
usefulness of time on task and the task outcome, respectively, 
depend on the tasks’ difficulty. In easy tasks assessing basic skills, 
individual differences will mainly occur in response latencies, 
whereas accuracy will be consistently high. Following this logic, a 
number of assessment tools that address constructs like naming 
speed (e.g., Nicolson & Fawcett, 1994), visual word recognition 
(e.g., Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004), or 
number naming speed (e.g., Krajewski & Schneider, 2009) make 
use of time on task. In contrast, in more difficult tasks the accuracy 
of a result is of interest, for example, in assessments of reading 
comprehension (e.g., van den Broek, & Espin, 2012) or problem 
solving (e.g., Greiff, Wiistenberg, et al., 2013; Klieme, 2004; 
Mayer, 1994; Wirth & Klieme, 2003). In these skill assessments, 
time on task usually is not taken into account. Nevertheless, both 
the task result and time on task constitute task performance re- 
gardless of the task’s difficulty. 

In skill assessments, the relation between time on task and task 
result (accuracy) can be conceived of in two ways. On the one 
hand, taking more time to work on a task may be positively related 
to the result as the task is completed more thoroughly. On the other 
hand, the relation may be negative if working faster and more 
fluently reflects a higher skill level. The present study addresses 
these contradictory predictions and aims at clarifying the condi- 
tions of their validity by jointly analyzing task success and time on 
task data from the computer-based Programme for the Interna- 
tional Assessment of Adult Competencies (PIAAC; cf. OECD, 
2013; Schleicher, 2008). Thus, we take advantage of the fact that 
computer-based assessment renders data available on a large scale 
that was previously available only through small-scale experiment- 
ing (i.e., time on task). Data such as time spent on individual tasks 
can serve to answer basic research questions (such as clarifying the 
relation of time on task and task result in different domains). 
Furthermore, the data can enhance educational assessment. For 
instance, construct validation can be supported by testing whether 
behavioral process indicators are related to task outcomes as 
expected from theory. 


Time on Task 


Time on task is understood as the time from task onset to task 
completion. Thus, if the task was completed in order, it reflects the 
time taken to become familiar with the task, to process the mate- 
rials provided to solve the task, to think about the solution, and to 
give a response.’ In tasks requiring the participant to interact with 
the stimulus through multiple steps, time on task can be further 
split into components, for instance, reflecting the time taken to 
process a single page from a multipage stimulus. To model time on 
task, two different approaches have been suggested (cf. van der 
Linden, 2007, 2009). First, time is considered an indicator of a 
(latent) construct, for example, reading speed (Carver, 1992) or 
reasoning speed (Goldhammer & Klein Entink, 2011). Here, re- 
sponse and time data are modeled using separate measurement 


models. Second, within an explanatory item response model, time 
is used as a predictor to explain differences in task success (cf. 
Roskam, 1997). In the present study, this second approach is used 
to investigate the relation between time on task and task success. 
Task success (dependent variable) can be perceived as a function 
of time on task (independent variable) because the individual is 
able to control time spent on completing a task to some extent, 
which in turn may affect the probability of attaining the correct 
result (cf. van der Linden, 2009). 


Relation of Time on Task to Task Success 


When investigating the relation between time on task and task 
success, the well-known speed—accuracy tradeoff, which is usually 
investigated in experimental research (cf. Luce, 1986), has to be 
taken into account. Tradeoff means that for a given person working 
on a particular task, accuracy will decrease as the person works 
faster. The positive relation between time on task and task success, 
as predicted by the speed—accuracy tradeoff, is a within-person 
phenomenon that can be expected for any task (e.g., Wickelgren, 
1977). However, when switching from the within-person level to a 
population, the relation between time on task and task success 
might be completely different, for instance, a negative or no 
relation, although within each person, the speed—accuracy com- 
promise remains as the positive relation between time on task and 
task success (cf. van der Linden, 2007). Consequently, at the 
population level, findings on the relation of time on task with task 
success may be heterogeneous. One line of research modeling time 
on task as an indicator of speed provides speed-—skill or speed— 
ability correlations of different directions and strengths across 
domains. For example, for reasoning, positive correlations be- 
tween skill (measured through task success) and slowness (mea- 
sured through time on task) were found (e.g., Goldhammer & 
Klein Entink, 2011; Klein Entink, Fox, & van der Linden, 2009). 
For arithmetic zero correlations (van der Linden, Scrams, & 
Schnipke, 1999) were obtained, whereas for basic skills to operate 
a computer’s graphical user interface, a negative relation was 
demonstrated (Goldhammer, Naumann, & Keel, 2013), as was 
for basic reading tasks such as phonological comparison and 
lexical decision (Richter, Isberner, Naumann, & Kutzner, 2012). 

These results suggest that the time on task effect might be 
moderated by domain and task difficulty. A comparison of tasks 
across studies reveals that in difficult tasks assessing for instance 
reasoning, task success is positively related to time on task, 
whereas in easy tasks, such as basic interactions with a computer 
interface, the relation is negative. Independent evidence for this 
line of reasoning comes from research suggesting that task diffi- 
culty within a given domain affects the association between time 
on task and task success. Neubauer (1990) investigated the corre- 
lation between the average time on task and the test score for 
figural reasoning tasks and found a zero correlation. However, for 
task clusters of low, medium, and high difficulty, he found nega- 


' Depending on what is considered to be a task, there may be alternative 
definitions of time on task. For instance, in this special section, Kupiainen, 
Vainikainen, Marjanen, and Hautamaki (2014) use the term time on task to 
refer to the time needed to complete a test in a learning to learn assessment, 
whereas response time is considered to represent the time needed to 
respond to a single question or problem (which is comparable to our notion 
of time on task). 
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tive, zero, and positive correlations, respectively. Similarly, in a 
recent study by Dodonova and Dodonoy (2013), the strength of the 
negative correlation between time on task and accuracy in a letter 
sequence task tended to decrease with increasing task difficulty. 


Time on Task Effects and Dual Processing Theory 


An explanation for the heterogeneity of associations between 
time on task and task success may be provided by dual processing 
theory, which distinguishes between automatic and controlled 
mental processes (cf. Fitts & Posner, 1967; Schneider & Chein, 
2003; Schneider & Shiffrin, 1977). Automatic processes are fast, 
proceduralized, and parallel; they require little effort and operate 
without active control or attention, whereas controlled processes 
are slow, are serial, require attentional control, and can be alter- 
nated quickly. Tasks are amenable to automatic processing due to 
learning only under consistent conditions, that is, rules for infor- 
mation processing including related information-processing com- 
ponents and their sequence are invariant (Ackerman, 1987). Learn- 
ing under consistent conditions can be divided into three stages (cf. 
Ackerman & Cianciolo, 2000; Fitts & Posner, 1967). The first 
stage, when the individual acquires task knowledge and creates 
a production system (cf. Adaptive Control of Thought [ACT] 
theory; Anderson & Lebiere, 1998), is characterized by con- 
trolled processing. Automatic processing becomes more appar- 
ent in the second stage and dominates in the third stage. Thus, 
task performance is slow and error prone at the beginning of 
learning, but speed and accuracy increase as the strength of 
productions is increased through practice (Anderson, 1992). 

Consequently, in domains and tasks that allow for automatic 
processing, a negative association between time on task and 
task success is expected. Well-practiced task completion is 
associated with both fast and correct responses. In contrast, a 
positive association is expected in domains and tasks that do not 
allow for a transition from controlled to automatic processing 
due to inconsistent processing rules and variable sequences of 
information processing. Taking more time to work carefully 
would positively impact task success. In line with this reason- 
ing, Klein Entink et al. (2009) showed that test effort in a 
reasoning test, that is, the extent to which a test taker cares 
about the result, is positively related to test-taking slowness 
(measured through time on task), which itself is positively 
related to skill (measured through task success). 

Notably, dual processing theory suggests a dynamic interaction 
of automatic and controlled processing in that the acquisition of 
higher level cognition is enabled by and builds upon automatic 
subsystems (Shiffrin & Schneider, 1977). Basically, tasks within 
and between domains are assumed to differ with respect to the 
composition of demands that necessarily require controlled 
processing and those that can pass into automatic processing 
(Schneider, & Fisk, 1983). Similarly, for a particular task, indi- 
viduals are assumed to differ in the extent to which the task- 
specific information-processing elements that can be automatized 
are actually automatized (e.g., Carlson, Sullivan, & Schneider, 
1989). In the following two sections, we describe in detail how 
automatic and controlled processes may interact in the two do- 
mains considered, reading and problem solving. 


Time on Task in Reading 


Reading a text demands a number of cognitive component 
processes and related skills. Readers have to identify letters and 
words. Syntactic roles are then assigned to words, sentences are 
parsed for their syntax, and their meaning is extracted. Coherence 
must be established between sentences, and a representation of the 
propositional text base must be created, as well as a situation 
model of the text contents, integrated with prior knowledge 
(Kintsch, 1998). In addition, cognitive and metacognitive regula- 
tions might be employed. When text contents are learned, strate- 
gies of organization and elaboration will aid the learning process. 

These different cognitive component skills allow for a transition 
from controlled to automatic processing to different degrees. Pro- 
cesses such as phonological recoding, orthographic comparison, or 
the retrieval of word meanings from long-term memory are slow 
and error prone in younger readers but become faster and more 
accurate as reading skill acquisition progresses (Richter, Isberner, 
Naumann, & Neeb, 2013). Indeed, theories of reading such as the 
lexical quality hypothesis (Perfetti, 2007) claim that reading skill 
rests on reliable as well as quickly retrievable lexical representa- 
tions. In line with this, text comprehension is predicted by the 
speed of access to phonological, orthographic, and meaning rep- 
resentations (e.g., Richter et al., 2012, 2013). Beyond the word 
level, the speed of semantic integration and local coherence pro- 
cesses are equally positively related to comprehension (e.g., Nau- 
mann, Richter, Christmann, & Groeben, 2008; Naumann, Richter, 
Flender, Christmann, & Groeben, 2007; Richter et al., 2012). As 
shown by longitudinal studies, accuracy in reading assessments 
during primary school approaches perfection, whereas reading 
fluency reflecting reading performance per time unit continues to 
increase across years of schooling (cf. Landerl & Wimmer, 2008). 
The high accuracy rates suggest that reading is already well 
automatized during primary school. 

Following this line of reasoning, in reading tasks, a negative 
time on task effect might be expected. A number of reading tasks, 
however, require attentional cognitive processing to a substantial 
degree as well. For instance, readers might need to actively choose 
which parts of a text to attend to when pursuing a given reading 
goal (e.g., Grasel, Fischer, & Mandl, 2000; Naumann et al., 2007, 
2008; Organisation for Economic Co-Operation and Development 
[OECD], 2011, chap. 3; Puntambekar & Stylianou, 2005). In the 
case of a difficult text, strategies such as rereading or engaging in 
self-explanations (e.g., Best, Rowe, Ozuru, & McNamara, 2005; 
McKeown, Beck, & Blake, 2009) are needed for comprehension. 
Also, in skilled readers, such processes require cognitive effort 
(Walczyk, 2000), and effort invested in strategic reading positively 
predicts comprehension (e.g., Richter, Naumann, Brunner, & 
Christmann, 2005; Sullivan, Gnedsdilow, & Puntambekar, 2011). 
This, however, will involve longer time spent on task. 

Taken together, this means that in easy reading tasks, the po- 
tentially automatic nature of reading processes at the word, sen- 
tence, and local coherence level leads to a negative time on task 
effect (e.g., when reading a short and highly coherent linear text). 
As reading tasks become more difficult and readers need to engage 
in strategic and thus controlled cognitive processing, the negative 
time on task effect will be diminished or reversed. 
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Time on Task in Problem Solving 


Problem solving is required in situations where a person cannot 
attain a goal by using routine actions or thinking due to barriers or 
novelty (e.g., Funke & Frensch, 2007; Mayer, 1992; Wirth & 
Klieme, 2003). Problem solving requires higher order thinking, the 
finding of new solutions, and sometimes interaction with a dy- 
namic environment (Klieme, 2004; Mayer, 1994). In the present 
study, a specific concept of problem solving as defined for the 
PIAAC study is taken into account; it refers to solving information 
problems in technology-rich environments. That is, technology- 
based tools and information sources (e.g., search engines, Web 
pages) are used to solve a given problem by “storing, processing, 
representing, and communicating symbolic information” (OECD, 
2009b, p. 8). Information problems in this sense (e.g., finding 
information on the Web fulfilling multiple criteria to take a deci- 
sion) cannot be solved immediately and routinely. They require 
developing a plan consisting of a set of properly arranged subgoals 
and performing corresponding actions through which the goal state 
can be reached (e.g., identifying the need for information to be 
obtained from the Web, defining an appropriate Web search query, 
scanning the search engine results page, checking linked Web 
pages for multiple criteria, collecting and comparing information 
from selected Web pages, and making use of it in the decision to 
be taken). This differs, for instance, from solving logical or math- 
ematical problems where complexity is determined by reasoning 
requirements but not primarily by the information that needs to be 
accessed and used (OECD, 2009b). Cognitive and metacognitive 
aspects of problem solving as assessed in PIAAC include setting 
up appropriate goals and plans to achieve the goal state. This 
includes monitoring the progress of goal attainment, accessing and 
evaluating multiple sources of information, and making use of this 
information (OECD, 2009b, p. 11). 

Problem solving is a prototype of an activity that relies on 
controlled processing. Controlled processing enables an individual 
to deal with novel situations for which automatic procedures and 
productions have not yet been learned. Otherwise, the situation 
would not constitute a problem. Accordingly, Schneider and Fisk 
(1983) described skilled behavior in problem solving and strategy 
planning as a function of controlled processing. Notably, problem 
solving skill may also benefit from practice. The development of 
fluent component skills at the level of subgoals enables problem 
solvers to improve their strategies optimizing the problem solving 
process (see, e.g., Carlson, Khoo, Yaure, & Schneider, 1990). 

General conceptualizations of (complex) problem solving con- 
ceive problem solving performance as consisting of knowledge 
acquisition including problem representation and the application of 
this knowledge to generate solutions (cf. Funke, 2001; Greiff, 
Wiistenberg, et al., 2013). Wirth and Leutner (2008) identified two 
simultaneous goals in the knowledge acquisition phase, that is, 
generating information through inductive search and integrating 
this information into a coherent model. Successful problem solvers 
move more quickly from identification to integration and thus will 
be able to invest time in advanced modeling and prediction (which 
provide the basis for successful knowledge application) rather than 
in low-level information processing. 

Problem solving in technology-rich environments assumes two 
concepts, accessing information and making use of it, that seem 
similar to knowledge acquisition and application. However, there 


are differences in that, for instance, retrieving information (e.g., by 
means of a search engine) is not comparable to an inductive search 
for rules governing an unknown complex system. Nevertheless, 
the various notions of problem solving assume successive steps of 
controlled information processing that may benefit from fluent 
component skills. 

Therefore, a positive effect of time on task on task success is 
expected for problem solving. Taking sufficient time allows for all 
serial steps to planned subgoals to be processed, as well as more 
sophisticated operations to be used and properly monitored regard- 
ing progress. Particularly for weak problems solvers, spending 
more time on a task may be helpful to compensate for a lack of 
automaticity in required subsystems (e.g., reading or computer 
handling processes). 


Research Goal and Hypotheses 


Our general research goal was to assess and investigate behav- 
ioral processes and their relation to task performance in computer- 
based assessment. More specifically, we determined the effect of 
time on task on the task result and the conditions that influence 
the strength and direction of this effect. For this, we used the 
computer-based assessment of reading and problem solving in the 
international large-scale study PIAAC, including log file data 
generated by the assessment system. 

From a dual processing framework, we derived the general 
hypothesis that the relative degree of controlled versus automatic 
cognitive processing as required by a task, as well as the test 
taker’s acquired skill level, determines the strength and direction 
of the time on task effect. The following three hypotheses address 
time on task effects across domains, task properties, and person 
characteristics. The fourth hypothesis aims at validating the inter- 
pretation of the time on task effect in problem solving by splitting 
up the global time on task into components that represent different 
steps of task solution and information processing. 


Hypothesis I: Time on task effect across domains. We ex- 
pected a positive time on task effect for problem solving in 
technology-rich environment tasks. A negative time on task 
effect was expected for reading tasks because, in reading 
tasks, a number of component cognitive processes are apt for 
automatization. Problem solving, in contrast, by definition 
must rely on controlled processing to a substantial degree in 
each task. 


Hypothesis 2: Time on task effect across tasks. Within do- 
mains, we expected the time on task effect to be moderated by 
task difficulty. Easy tasks can be assumed to be completed 
substantially by means of automatic processing, whereas dif- 
ficult tasks evoking more errors require a higher level of 
controlled processing. Accordingly, we expected a positive 
time on task effect in problem solving to be accelerated with 
increasing task difficulty, and a negative time on task effect in 
reading to diminish with increasing task difficulty. 

As our intepretation of the time on task effect focuses the 
way of cognitive processing, we additionally explored the 
potentially moderating role of the cognitive operation in- 
volved in each task as defined a priori by the PIAAC assess- 
ment framework (e.g., access in reading). More specifically, 
we investigated whether the task characteristic “cognitive 
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operation” explains task difficulty and if so whether the time 
on task effect would depend on the presence of specific 
cognitive operations. 


Hypothesis 3: Time on task effect across persons. For a 
given task, individuals are assumed to differ in the extent to 
which the information-processing elements that are amena- 
ble to automatic processing are actually automatized. 
Highly skilled individuals are expected to be in command 
of well-automatized procedures within task solution sub- 
systems that are apt to automatization (such as decoding in 
reading or using shortcuts to perform basic operations in a 
computer environment). We therefore expect the time on 
task effect to vary across persons. On the one hand, we 
predict that the time on task effect gets more positive for 
less skilled problem solvers and less negative for less 
skilled readers since they are expected to accomplish tasks 
with higher demands of controlled and strategic processing 
than skilled persons. For example, poor readers may rely on 
compensatory behaviors and strategies, especially when 
completing difficult tasks (see Walczyk, 2000). On the 
other hand, for skilled persons, we expect the inverse result, 
that is, due to a higher degree of routinized processing, the 
time on task effect gets less positive for skilled problem 
solvers and more negative for skilled readers. 


Hypothesis 4: Decomposing time on task effect at task level. 
Computer-based assessment and especially the exploitation of 
log file data can help to further understand the task completion 
process. By moving from the global process measure of time 
on task to the underlying constituents, we can further validate 
the interpretation of the time on task effect. This is especially 
true for tasks requiring a complex sequence of stimulus inter- 
actions that can be reconstructed from a log file, giving insight 
into the accuracy and timing with which subgoals were being 
completed. In the present study, tasks assessing problem 
solving in technology-rich environments are highly inter- 
active, requiring the operation of simulated computer and 
software environments or navigation in simulated Web envi- 
ronments. For a particular task, we expect that a positive time 
on task effect is confined to the completion of steps that are 
crucial for a correct solution (e.g., in a Web environment, 
visiting a page that presents information needed to give a 
correct response), whereas for others the effect is assumed to 
be negative (e.g., in a Web environment, visiting an irrelevant 
page). If this were the case, it would corroborate our assump- 
tion that it is the need for strategic and controlled allocation of 
cognitive resources that produces a positive time on task effect 
in problem solving or very difficult reading tasks. 


Method 


Sample 


The PIAAC study initiated internationally by the OECD (cf. 
OECD, 2013; Schleicher, 2008) is a fully computer-based inter- 
national comparative study assessing the competence levels of 
adults in 2011-2012. For the present study, data provided by 
GESIS—Leibniz Institute for the Social Sciences from the German 


PIAAC field test in 2010 were used. The target population con- 
sisted of all noninstitutionalized adults between the ages of 16 and 
65 years (inclusive) who resided in Germany at the time of sample 
selection and were enrolled in the population register. For the field 
test in Germany, a three-stage sampling was used with probability 
sample of communities and individuals in five selected federal 
states. The within-household sample included in the present study 
comprised 1,020 individuals completing the computer-based PI- 
AAC assessment. Of these, 520 were male (50.98%) and 458 
female (44.90%). For 42 participants, no gender information was 
available (4.12%). The average age was 39.40 years (SD = 13.30). 


Instrumentation 


Reading literacy. The PIAAC conceptual framework for 
reading literacy is based on conceptions of literacy from the 
International Adult Literacy Survey (IALS) conducted in the 
1990s and the Adult Literacy and Life Skills Survey (ALL) con- 
ducted in 2003 and 2006 (see OECD, 2009a). It was extended for 
PIAAC to cover reading skill in the information age by including 
skills of reading in digital environments. More than half of the 
reading tasks were taken from the former paper-based adult liter- 
acy assessments IALS and ALL to link PIAAC results back to 
these studies. New tasks simulating digital (hypertext) environ- 
ments were developed to cover the broadened construct including 
skills of reading digital texts. The tasks covered the cognitive 
operations “access and identify information,” “integrate and inter- 
pret information,” and “evaluate and reflect information” (see 
OECD, 2009a). The majority of tasks included print-based texts as 
used in previous studies (e.g., newspapers, magazines, books). 
Tasks representing the digital medium included, for instance, 
hypertext and environments such as message boards and chat 
rooms. Tasks are also varied with respect to the context (e.g., 
work/occupation, education and training) and whether they in- 
cluded continuous texts (e.g., magazine articles), noncontinuous 
texts (e.g., tables, graphs), or both. 

In the PIAAC field test, 72 reading tasks were administered. For 
the present study, only those 49 tasks were used that entered the 
main study. To respond, participants were required to highlight 
text, to click a (graphical) element of the stimulus, to click a link, 
or to select a check box. As a sample, Figure 1 (upper panel) 
presents a screenshot from the first “Preschool Rules” task. Re- 
spondents were asked to answer the question shown on the left side 
of the screen by highlighting text in the list of preschool rules on 
the right side. The question was to figure out the latest time that 
children should arrive at preschool. Thus, readers were required to 
access and identify information, the context was personal, and 
print text was presented. 

Problem solving in technology-rich environments. This 
construct refers to using information and communication technol- 
ogy (ICT) to collect and evaluate information so as to communi- 
cate and perform practical tasks such as organizing a social activ- 
ity, deciding between alternative offers, or judging the risks of 
medical treatments (OECD, 2009b). The framework (OECD, 
2009b) defined multiple task characteristics that formed the basis 
for instrument development. The cognitive operations to be cov- 
ered by the tasks were goal setting and progress monitoring, 
planning and self-organizing, acquiring and evaluating informa- 
tion, and making use of information. The technology dimensions 
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Section 4 





Look at the list of preschool rul 
Highlight information in the list to Preschool Rules 


answer the question below. 


What is the latest time that children 


shouldiartive at preschool? Welcome to our Preschool! We are looking forward to a great year of fun, learning 


and getting to know each other. Please take a moment to review our preschool 
niles. 


e Please have your child here by 9:00 am. 
¢ Bring a small blanket or pillow and/or a small soft toy for naptime. 
e Dress your child comfortably and bring a change of clothing. 


¢ Please no jewelry or candy. If your child has a birthday please talk to your 
child's teacher about a special snack for the children. 


¢ Please bring your child fully dressed, no pajamas. 

¢ Please sign in with your full signature. This is a licensing regulation. Thank 
you. 

« Breakfast will be served until 7:30 am. 


¢ Medications have to be in original, labeled containers and must be signed into 
the medication sheet located in each classroom. 


« Ifyou have any questions, please talk to your classreom teacher or to Ms. 
Marlene or Ms. Tree. 
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Unit 4 - Part 1 i 


file Edit Bookmark Help = 





You are looking for a job and have i oe see ee ee 
located these five websites. i € >» “aN { 4 ) i f) URL: fwww.websearch comijobsearch 
i oo } Hin, y 


You want to use a site that does not - 
require you to register or pay a fee. 


Bookmark all theses hat meet your Web Setch he 


requirements. 


Once you have bookmarked the sites, Find Your Job - JabSearch.com 
click Next to go on. any = - : 
The best job search site on the web, Check with us first! 


www jobsearch.com 


Job Links 

We connect you with the best jobs on the web. 
www joblinks com 

Looking for a job? 
Start your job search here. 
www.careerstarters.com 





Connections.com 


We provide access to the best jobs 
www.connections.com 





If you are looking for the perfect job, start right here, 
www greatjobs com 





Web 





Figure 1. Sample tasks: reading literacy task “Preschool Rules” (upper panel); problem solving in technology- 
rich environments task “Job Search” with only the start page showing the search engine results depicted, not the 
linked pages (lower panel). OECD = Organisation for Economic Co-Operation and Development; PIAAC = 
Programme for the International Assessment of Adult Competencies. 


included hardware devices (e.g., desktop or laptop computers), tons, links, sort, find), and multiple representations (e.g., text, 
software applications (e.g., file management, Web browser, numbers, graphics). Moreover, task development aimed at the 
e-mail, spreadsheet), various commands and functions (e.g., but- variation of the task’s purpose (e.g., personal, work/occupation), 


614 GOLDHAMMER ET AL. 


intrinsic complexity (e.g., the minimal number of actions required 
to solve the problem, the number of constraints to be satisfied), and 
the explicitness of the problem (implicit, explicit). 

As defined by the framework (OECD, 2009b), tasks were 
developed in such a way that they varied in the number of 
required cognitive operations (e.g., acquiring and evaluating 
information), the number and kind of actions that have to be 
taken to solve the task in a computer environment, the inclusion 
of unexpected outcomes or impasses, and the extent to which 
the tasks were open-ended. A more difficult task simulating 
real-life problem solving would require several cognitive oper- 
ations, multiple actions in different environments, unexpected 
outcomes, and the planning of multiple subgoals that may 
depend on each other. A corresponding sample task would be 
one in which the problem solver has to do a Web search on the 
Internet to access information, integrate and evaluate informa- 
tion from multiple online sources by using a spreadsheet, and 
then create a summary of the information to be presented at 
school by using a presentation software. 

In the PIAAC field test, 24 problem solving tasks were admin- 
istered. Of these tasks, only 13 were selected for the main study. 
For the present study, all available tasks were considered to obtain 
more reliable results on the correlation of effects varying across 
tasks. After excluding tasks with poor discrimination and tasks for 
which no score could be derived, 18 tasks were left. In the context 
of international large-scale assessments, further tasks may be 
dropped, especially if they show differential item functioning 
across participating countries. However, as we only used national 
data and did not aim at comparing countries, there was no need to 
consider task-by-country interactions. To give a response in the 
simulated computer environments, participants were required to 
click buttons, menu items, or links, to select from drop-down 
menus, to drag and drop, and so on. 

As a sample, Figure 1 (lower panel) presents a screenshot 
from the task “Job Search.” Regarding cognitive operations, 
participants had to access and evaluate information and monitor 
criteria for constraint satisfaction within a simulated job search. 
Thus, the task’s purpose was occupational. Starting from a 
search engine results page, the task was to find all the sites that 
do not require users to register or pay a fee and to bookmark 
these sites. Regarding the explicitness of the problem, instruc- 
tions did not directly tell participants the number of sites they 
must locate, but evaluation criteria were clearly stated. To solve 
the task, single actions of evaluation had to be repeated for each 
website; for a target page, multiple constraints needed to be 
satisfied. Both characteristics determined intrinsic complexity. 
As regards software applications and related commands, the 
task was situated in a simulated Web environment that included 
tools and functionality similar to those found in real-life 
browser applications, that is, clickable links, back and forward 
buttons of the browser, and a bookmark manager that allowed 
one to create, view, and change bookmarks. The opening page 
presented the task description on the left side and the results of 
the Web search engine, that is, clickable links and brief infor- 
mation about the linked page, on the right side of the screen. 
From this search engine results page, participants had to access 
the hypertext documents connected via hyperlinks to locate and 
bookmark those websites that meet the search criteria. 


Design and Procedure 


A rotation design was used to form 21 booklets resulting in an 
effective sample size for reading literacy of 113 to 146 responses 
per task and for problem solving in technology-rich environments 
of 140 to 191 responses per task. 

Data were collected in computer-assisted personal interviews. 
Interviewers went to the participants’ households to conduct the 
interview in person. First, participants completed a background 
questionnaire, and then the interviewer handed the notebook to the 
participant for completion of the cognitive tasks. There was no 
global time limit, that is, participants could take as long as they 
needed. Participants only completed the computer-based tasks if 
they were sufficiently ICT literate, which was tested by ICT tasks 
requiring basic operations such as highlighting text by clicking and 
dragging. In case of nonsufficient ICT literacy, a paper-based 
assessment was administered. In the computer-based part, partic- 
ipants were randomly assigned to booklets including reading lit- 
eracy, numeracy, and problem solving tasks. For the present study, 
only data from the computer-based assessment of reading literacy 
and problem solving were included. 


Statistical Analyses 


Modeling approach. The generalized linear mixed model 
(GLMM) framework (e.g., Baayen, Davidson, & Bates, 2008; De 
Boeck et al., 2011; Doran, Bates, Bliese, & Dowling, 2007) was 
used to investigate the role of time on task in reading and problem 
solving (Hypotheses 1-3). A linear model consists of a component 
Np Tepresenting a linear combination of predictors determining 
the probability of person p for solving task i correctly. The pre- 
dictors’ weights are called effects. Modeling mixed effects means 
to include both random effects and fixed effects. Fixed effects are 
constants across units or groups of a population (e.g., tasks, per- 
sons, classrooms), whereas random effects may vary across units 
or groups of a population (cf. Gelman, 2005). The generalized 
version of the linear mixed model accommodates also categorical 
response variables. In measurement models of item response the- 
ory, for instance, the effect of each item or task i on the probability 
of obtaining a correct response is typically estimated as a fixed 
effect representing the task’s difficulty or easiness. The effect of 
person p is usually modeled as random, that is, as an effect which 
may vary across persons and for which the variance is estimated. 
The variance of this random effect represents the variability of 
skill across persons. 

The GLMM incorporating both random effects, b, and fixed 
effects, B, can be formulated as follows: yn = XB + Zb (e.g., 
Doran et al., 2007). In this model, X is a model matrix for 
predictors with fixed weights included in vector B, and Z is a 
model matrix for predictors with random weights included in 
vector b. The distribution of the random effects is modeled as a 
multivariate normal distribution, b~N(0, %), with & as the cova- 
riance matrix of the random effects. The continuous linear com- 
ponent 7,,; is linked to the observed ordered categorical response 
Y,,; (correct vs. incorrect) by transforming the expected value of the 
observed response, that is, the probability to obtain a correct 
response 7; When using the log-transformed odds ratio (log- 
odds), the logit link function follows: 1,; = In(m,/(1 — 1,,)) (cf. 
De Boeck et al., 2011). 
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In the present study, to address the research question of whether 
the strength of the time on task effect is correlated with the 
easiness of tasks, the effects of both persons and tasks were 
defined as random intercepts (cf. random person random item 
model; De Boeck, 2008). A fixed intercept, Bo, is estimated 
additionally, which is the same for all participants and tasks. 

A baseline Model MO was obtained by specifying an item 
response model (1PL or Rasch model) with task and person as 
random intercepts and by adding the time on task as person-by- 
item predictor with a fixed effect B,. Model MO serves as parsi- 
monious reference model that is compared with more complex 
models including further fixed and/or random effects: y,,, = (in- 
tercept Bo) + (individual skill bo,,) + (relative easiness bo;) + B, 
(time on task 1,,,). 

In the following analyses, this model is systematically extended 
by adding further predictors. For example, the predictor (time on 
task ¢,;) with the random weight b,, is added, providing the 
variance of the by-task adjustment b,, to the fixed time on task 
effect B,. As the by-task adjustment, b,,, and task easiness, bo,, are 
tied to the same observational unit, that is, task 7, their association 
is also estimated. This correlation can be used to test whether the 
strength of the time on task effect linearly depends on task diffi- 
culty (as claimed by Hypothesis 2). Figure 2 shows the path 
diagram of Model M1, which is Model MO extended by the 
predictor (time on task ¢,;) with a random weight across tasks, Dj; 
(cf. the graphical representations of GLMMs by De Boeck & 
Wilson, 2004). In Model M1, there is a fixed time on task effect, 
8,, representing the average time on task effect. However, it is 







(intercept B, 


adjusted by task by adding the weight b,,;, which allows the time 
on task effect to vary across tasks as indicated by subscript 7. The 
other models under consideration can be derived in a similar 
fashion by adding random effects adjusting the time on task effect 
by cognitive operation (Model M2, cf. Hypothesis 2), by person 
(Model M3, cf. Hypothesis 3), or by task and person (Model M4, 
integrating Hypothesis 2 and Hypothesis 3). 

To clarify whether the introduction of further random compo- 
nents into the model significantly improves model fit, model 
comparison tests were conducted. For comparing nested models, 
the likelihood ratio (LR) test was used, which is appropriate for 
inference on random effects (Bolker et al., 2009). The test statistic, 
that is, twice the difference in the log-likelihoods, is approximately 
x distributed with degrees of freedom equal to the number of 
extra parameters in the more complex model. The LR test is 
problematic when the null hypothesis implies the variance of a 
random effect to be zero; this means that the parameter value is on 
the boundary of the parameter space (boundary effect; cf. Baayen 
et al., 2008; Bolker et al., 2009; De Boeck et al., 2011). Using the 
chi-square reference distribution increases the risk of Type II 
errors; therefore, the LR test has to be considered as a conservative 
test for variance parameters. 

For the analysis at the task level (Hypothesis 4), logistic regres- 
sion was used to predict task success by the time taken on indi- 
vidual steps of the task completion sequence. 

Interpreting the effect of time on task in the GLMM. The 
“fundamental equation of RT modeling” (van der Linden, 2009, 
p. 259) assumes that the response time (RT; time on task) of person 


Zyia = (time on 
task t,)) 


Figure 2. Graphical representation of model M1 showing how the probability to obtain a correct response, 1,;, 
is affected by a general intercept, Bo, the relative task easiness, bo;, and individual skill, bo. Moreover, there is 
a time on task effect consisting of a fixed part, 8,, as well as random part, b,,, which means that the time on task 


effect may vary across tasks 7. 
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p when completing task i depends both on the person’s speed 1, 
and the task’s time intensity \,. Accordingly, the expected value of 
the (log-transformed) response time can be defined as follows: 
EUDG)) Na 5 (cf. van der Linden, 2009). This implies that 
the effect of time on task reflects both the effect of the person and 
the task component. 

When the effect of time on task is introduced as an overall fixed 
effect B,, as in Model MO, this effect would reflect the association 
between time on task and the log-odds ratio of the expected 
response. This association could not be interpreted in a straight- 
forward way, as it depends not only on the correlation between 
underlying person-level parameters, that is, skill and speed, but 
also on the correlation of corresponding item parameters, that is, 
difficulty and time intensity (see van der Linden, 2009).* However, 
when modeling the effect of time on task as an effect random 
across tasks (Hypothesis 1), groups of tasks supposed to be ho- 
mogeneous (Hypothesis 2), or individuals (Hypothesis 3), the 
influences from the task and person levels can be disentangled. 

A time on task effect random across tasks is obtained by 
introducing the by-task adjustment b,, to the fixed time on task 
effect B,. The time on task effect by task results as B, + Dj,. 
Thereby, time on task is turned into a person-level covariate 
varying between tasks. That is, given a particular task with certain 
time intensity, variation in time on task is only due to differences 
in persons’ speed (plus residual). This allows us to interpret time 
on task as an task-specific speed parameter predicting task success 
above and beyond individual skill. 

A by-person random time on task effect means to adjust the 
fixed time on task effect 8, by the person-specific parameter D,,,, 
resulting in the time on task effect 8B, + b,,. The fixed effect 
shows a constant as subscript, whereas the random effect is pro- 
vided additionally with p as subscript indicating that the effect may 
vary across persons p. Given a particular person working at a 
certain speed level, variation in time on task is only due to 
differences in the tasks’ time intensity (plus residual). This means 
that time on task can be conceived of as a task-level covariate that 
is specific to persons and predicts task success above and beyond 
task easiness. 

Trimming of time data. As a preparatory step for data anal- 
ysis, the (between-person) time on task distribution of each task 
was inspected for outliers. The middle part of a time on task 
distribution was assumed to include the observations that are most 
likely to come from the cognitive processes of interest. To exclude 
extreme outliers in time on task and to minimize their effect on 
analyses, observations two standard deviations above (below) the 
mean were replaced by the value at two standard deviations above 
(below) the mean. As even a single extreme outlier can consider- 
ably affect mean and standard deviation, time on task values were 
initially log-transformed, which means that extremely long time on 
task values were pulled to the middle of the distribution. With this 
trimming approach, 4.79% of the data points in reading literacy 
and 4.67% in problem solving were replaced. Transforming a 
covariate may have an impact on estimated parameters of the 
linear mixed model (for linear transformations, see, e.g., Morrell, 
Pearson, & Brant, 1997). Therefore, we conducted the analyses 
also without log-transforming the time on task variable. As we 
obtained the same result pattern, we report the analyses with log 
transformation only. Results obtained with the untransformed data 
are available from the first author upon request. 


Statistical software. For estimating the presented GLMMs, 
the Imer function of the R package Ime4 (Bates, Maechler, & 
Bolker, 2012) was applied. The R environment (R Core Team, 
2012) was also used to conduct logistic regression analyses. 


Results 


Difficulty of Tasks 


To compare the difficulty of problem solving tasks and reading 
literacy tasks, the baseline Model MO was tested for both domains 
without the time on task effect. For reading literacy, an intercept of 
Bo = 0.61 (z = 3.21, p < .01) was obtained; it represents the 
marginal log-odds for a correct response in a task of average 
easiness completed by a person of average skill; the corresponding 
probability was 64.68%. For problem solving, the result was By = 
—0.72 (z = —2.37, p < .01), indicating that the probability of a 
correct response was on average only 32.68%, that is, problem 
solving tasks were much harder than reading literacy tasks. Figure 
3 shows the densities of the estimated task easiness parameters for 
reading literacy tasks (upper panel) and problem solving tasks 
(lower panel). Task easiness values were obtained by adding the 
intercept B, and the random task intercept (relative easiness bp,). 
The proportion of correct responses, p, ranged for reading literacy 
from 12.41% to 96.92% and for problem solving from 11.86% to 
77 49%. 


Time on Task Effect by Domain (Hypothesis 1) 


For testing Hypotheses 1 and 2, Model MO was extended to 
Model M1 by adding the by-task random time on task effect b,;: 
Npi = (intercept Bo) + (individual skill b,) + (relative easiness 
bo;) + B, (time on task ¢,;) + b,; (time on task f,,). 

To address Hypothesis | regarding the time on task effect by 
domain, the fixed time on task effects B,, as specified in Model 
M1 (see also Figure 2), were compared between reading literacy 
and problem solving. 

Reading literacy. Table | provides an overview of the results. 
For reading literacy, a negative and significant time on task effect 
of B, = —0.61 (z = —4.90, p < .001) was found. Thus, for a 
reading literacy task of average difficulty, correct responses were 
associated with shorter times on task, whereas incorrect responses 
were associated with longer times on task. 

Problem solving. For problem solving, a positive and signif- 
icant time on task effect of B, = 0.56 (z = 2.30, p = .02) was 
estimated. Thus, for a problem solving task of average difficulty, 
correct responses were associated with longer times on task and 
vice versa. These findings give support to Hypothesis 1. 


Time on Task Effect by Task (Hypothesis 2) 


If the assumption holds that task difficulty moderates the time 
on task effect, a relation between task easiness and the strength of 
the time on task effect should be observable within a domain. To 
test Hypothesis 2, the variances of the by-task adjustments to the 
fixed time on task effects and their correlations with task easiness, 


2 . . . . 
“We thank an anonymous reviewer who advised us to consider this 
issue. 
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Figure 3. Distribution of estimated task easiness parameters for reading 
literacy (upper panel) and problem solving in technology-rich environ- 
ments (lower panel). On average, reading literacy tasks were easier than 
problem solving tasks. 


as estimated through Model M1, were inspected for both domains 
under consideration. 

Reading literacy. For reading literacy, the variability of the 
by-task adjustment was estimated to be Var(b,,;) = 0.55. This 
means that for reading literacy, the time on task effect varied 
across tasks. Most importantly, the by-task time on task effect and 
intercept were negatively correlated, Cor(bo,, b,;) = —.39. That is, 
the overall negative time on task effect became even stronger in 
easy tasks but was attenuated in difficult tasks. The upper left 
panel in Figure 4 illustrates how the time on task effect in reading 
literacy was adjusted by task. To test whether the model extension 
improved the model’s goodness of fit, we compared the nested 
Models MO and M1. The difference test showed that Model M1 
fitted the data significantly better than Model MO, (2) =! 77.655 
p < .001. To test whether the correlation parameter was actually 
needed to improve model fit, that is, to test the significance of the 


correlation, Model M1 was compared to a restricted version 
(Model Mir), which did not assume a correlation between by-task 
time on task effect and by-task intercept. The model difference test 
suggested that the unrestricted version of Model M1 had a better fit 
to the data than the restricted version, x7(1) = 5.16, p = .02. Thus, 
the negative correlation between the by-task adjustment of the time 
on task effect and the random task intercept (i.e., task easiness) 
was also significant. 

Problem solving. For problem solving, the variance of the 
by-task adjustment to the fixed effect of time on task was esti- 
mated as Var(b,;) = 0.89. Thus, for problem solving in 
technology-rich environments, the time on task effect varied across 
tasks. The correlation between the by-task adjustment to the time 
on task effect and task easiness was negative as for reading 
literacy, Cor(bp;, b,;) = —.61. That is, the overall positive time on 
task effect became even stronger in hard-to-solve tasks but was 
attenuated in easy-to-solve tasks. Figure 4 (upper right panel) 
illustrates how the time on task effect in problem solving was 
adjusted by task. The model difference test, comparing the nested 
Models MO and M1, clearly showed that adding the random time 
on task effect in Model M1 improved the model fit, x7 (2) = 73.99, 
p < .001. Moreover, comparing Model M1 with a restricted 
version (Model Mlr) without a correlation between the by-task 
time on task effect and the random task intercept revealed that the 
correlation was significant, y (1) = On, p= .01. 

All together, these results give clear support to Hypothesis 2. In 
a domain where task solution cannot rely on automatic processes 
such as problem solving, the already positive time on task effect 
was substantially increased in tasks that were especially difficult. 
In a domain where rapid automatic processing can account for a 
substantial part of the task solution process such as reading, an 
already negative time on task effect became even stronger in easier 
tasks but diminished in more difficult tasks. 


Time on Task Effect by Cognitive Operation 


An alternative explanation for the variability of the time on task 
effect between tasks refers to differences in the required cognitive 
operations. That is, tasks being homogeneous with respect to 
cognitive operations would show similar time on task effects. To 
test whether the presence of different cognitive operations as 
detailed by the respective frameworks affects the time on task 
effect, we extended Model MO to the following Model M2 by 
introducing the cognitive operation c required in a task as a 
categorical task-level predictor and as a factor moderating the time 
on task effect, which is represented by the random weight b,.: n,; = 
(intercept By) + (individual skill b,,) + (relative easiness bo;) + B, 
(time on task #,;) + (cognitive operation bo.) + b,, (time on task 
tyi)- 

Reading literacy. For reading literacy, the PLAAC framework 
assumes three broad aspects of cognitive operation, access and 
identify information, integrate and interpret information, and eval- 
uate and reflect information. In a first step, we tested an explan- 
atory item response model with random person and task effects as 
well as the effect of cognitive operation. For the three aspects of 
cognitive operations, the intercepts of 1.07 (z = 4.72, p < .01), 
0.00 (z = 0.00, p = 1.00), and 0.08 (z = 0.19, p = .85) were 
estimated. The probabilities of a correct response corresponding to 
these intercepts were 74.50%, 50.01%, and 51.96%. Access tasks 
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Overview of Main Model Parameters on the Time on Task Effect and Model Comparison Tests 


Domain Research question/hypothesis Model 
Reading literacy Baseline model MO 

Testing Hypotheses | and 2: M1 
Time on task effect by 
domain and task 

Comparison with baseline model M1 vs. MO 

Restricted model without Mir 
random effect correlation 

Comparison with unrestricted M1 vs. Mir 


model 
Exploring the time on task effect M2 
by cognitive operation 


Restricted model without M2r 
random time on task effect 
across cognitive operations 

Comparison with unrestricted M2 vs. M2r 


model 
Testing Hypothesis 3: Time on M3 
task effect by person 


Comparison with baseline model = M3 vs. MO 
Restricted model without M3r 
random effect correlation 
Comparison with unrestricted M3 vs. M3r 
model 
Integrated model: Time on task M4 
effect by task and person 
Comparison with baseline model M4 vs. MO 
Problem solving Baseline model MO 
Testing Hypotheses | and 2: M1 
Time on task effect by 
domain and task 
Comparison with baseline model M1 vs. MO 
Restricted model without Mir 
random effect correlation 
Comparison with unrestricted M1 vs. Mir 


model 
Testing Hypothesis 3: Time on M3 
task effect by person 


Comparison with baseline model M3 vs. MO 

Restricted model without M3r 
random effect correlation 

Comparison with unrestricted M3 vs. M3r 
model 

Integrated model: Time on task M4 
effect by task and person 

Comparison with baseline model M4 vs. MO 


Time on x? of model ' 
task effect difference Variance of Correlation 
random test (df in Fixed-effect random of random 
across parentheses) B, effect effects 
—0.55*"* — — 
Tasks = (\Gillae 0.55 = 39) 
T6512) s 
Tasks —0.59***. 0.54 — 
5.16 (1)* 
Cognitive oe eee 0.003 — 1.00 
operations 
(55° « a — 
0.79 (1), ns 
Persons —(0!65*** 0.14 = 155) 
15.09 (2)™* 
Persons ==) 5a 0.09 ae 
eASomclyies 
Tasks (169° 0.64 —.52 
Persons 0.23 Se 
106.14 (4)*** 
0.49°"" — — 
Tasks 0.56" 0.89 S(O 
73.99 (2)"™ 
Tasks 0.54* 0.87 — 
6.50 (1)* 
Persons ONS aes 0.22 —.79 
5.98 (2)' 
Persons 0.49°** 0.12 = 
5.98 (1)* 
Tasks 0.56" 0.89 = .03 
Persons Ona eG 


76.77 (4)"™" 





Note. 


ior eae See or Oe ee) Olle 


were thus relatively easy, whereas integrate and evaluate tasks 
show quite the same level of medium difficulty; by introducing 
cognitive operation as an explanatory variable of task easiness, the 
variance of task easiness, Var(b,;), decreased from 1.52 to 1.24, 
which corresponds to R* = .20. 

To investigate whether the influence of time on task on task 
success varies across cognitive operations, Model M2 was tested. 
The obtained variance of the by-cognitive operation adjustment to 
the time on task effect was only Var(b,,..) = 0.003. Moreover, the 
correlation with the corresponding intercept was Cor(bo,, b,.) = 
—1.00, indicating overparameterization of the model. Model M2 


Dashes indicate that a parameter was not included in the model. M = Model; r = restricted. ns = not significant. 


was compared with a restricted model including no time effect 
varying across cognitive operations (Model M2r); there was no 
significant improvement of model fit, x7(2) = 0.79, p = .67. Thus, 
the time on task effect did not vary across cognitive operations. 
Problem solving. The time on task effect was not further 
investigated with respect to cognitive operations for two reasons. 
First, there was only a small set of 18 tasks available. Second, each 
of the problem solving tasks explicitly included multiple cognitive 
operations from a set of four dimensions, that is, goal setting and 
progress monitoring, planning and self-organizing, acquiring and 
evaluating information, and making use of information, as defined 
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Figure 4. Upper row: Time on task effect by task for reading literacy (left panel) and problem solving in 
technology-rich environments (right panel). The solid line indicates the fixed time on task effect; the dots show 
how it is adjusted by task. For difficult tasks, the time on task effect gets more positive, whereas it gets more 
negative for easy tasks. Lower row: Time on task effect by person for reading literacy (left panel) and problem 
solving in technology-rich environments (right panel). The solid line indicates the fixed time on task effect; the 
dots show how it is adjusted by person. For less able individuals, the time on task effect gets more positive, 


whereas for able persons, it gets more negative. 


by the PIAAC assessment framework (OECD, 2009b, p. 10). 
Given the constraints of a large-scale assessment, PIAAC only 
aimed at an overall indicator of problem solving. Our analyses 
would require a more fine-grained measure with a broad set of 
indicators for the various underlying cognitive operations. Al- 
though, for each task, one operation is assumed to be dominant, 
other operations might also be involved. For instance, the PIAAC 
framework maps the sample task “Job Search” to the cognitive 
operations of access and evaluating information as well as moni- 
toring criteria for constraint satisfaction. There were only two 
more tasks that showed a comparable set of assumed cognitive 
operations, whereas in other tasks the requirement of accessing 
information was combined with a different additional demand, for 
example, communicating information. Thus, it was not possible to 
form subgroups with a sufficient number of tasks being homoge- 
neous in the assumed composition of required cognitive opera- 
tions. 


Time on Task Effect by Person (Hypothesis 3) 


On the person level, we assumed that the effect of time on task 
varies across the individual skill level. To test Hypothesis 3, we 


extended Model MO to Model M3 by adding a random time on task 
effect, b,,,, representing the variation across individuals: y,,; = 
(intercept By) + (individual skill bo,,) + (relative easiness bo;) + 
B, (time on task ¢,;) + b,,, (time on task t,,,). 

Reading literacy. For reading literacy, the variance of the 
by-person adjustment was Var(b,,) = 0.14. Thus, for reading 
literacy, the time on task effect varied across persons. Most im- 
portantly, a correlation between the by-person time on task effect 
and by-person intercept of Cor(bo,, b;,) = —.65 was estimated. 
That is, the overall negative time on task effect became stronger in 
able readers but was attenuated in poor readers. The bottom left 
panel in Figure 4 illustrates how the time on task effect adjusted by 
person linearly decreases in more able persons. To clarify whether 
the liberal Model M3 better fitted the data, we compared the nested 
Models MO and M3. The model difference test revealed that Model 
M3 fitted the data significantly better than Model MO, x*(2) = 
15.09, p < .01. To test whether the correlation parameter is 
required to improve model fit, that is, to test the significance of the 
correlation, Model M3 was compared with a restricted version 
(Model M3r) without the correlation between by-person time on 
task effect and intercept. The model difference test revealed that 


lp 
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Model M3 without restrictions was the better fitting model, 
xe(ly 2. 85a Ol 

Problem solving. Similar results were obtained for problem 
solving. The variance of the by-person adjustment to the fixed 
effect of time on task was Var(b,,,) = 0.22. Thus, for problem 
solving in technology-rich environments, the time on task effect 
varied across persons. The correlation between the by-person adjust- 
ment of the time on task effect and the by-person intercept (indi- 
vidual skill) was again negative and substantial, Cor(bo,, 5,,) = 
—.79. That is, the overall positive time on task effect became even 
stronger in poor problem solvers but was attenuated in able prob- 
lem solvers (see the bottom right panel in Figure 4). The difference 
test comparing Model M3 including the random time on task effect 
with the baseline Model MO was almost significant, y7(2) = 5.98, 
p = .05. Finally, comparing Model M3 with a restricted version 
(Model M3r) without a correlation between by-task time on task 
effect and intercept revealed that the correlation was significant, 
Y-(1) = 5.98, p =..01 


Integrated Model: Time on Task Effect by Task 
and Person 


As assumed in Hypotheses 2 and 3, the previous results indicate 
that task difficulty and individual skill level have an influence on 
the strength and direction of the time on task effect. The final 
Model M4 integrates both the by-task and the by-person adjust- 
ments to the time on task effect. The results found for Models M1 
and M3 were perfectly reproduced in the following Model M4: y,,; = 
(intercept By) + (individual skill bo,) + (relative easiness bo;) + 
B, (time on task f,,,) + b,; (time on task f,;) + bj, (time on task 1,,;). 

Reading literacy. For reading literacy, the time on task effect 
was estimated to be 8, = —0.69 (z = —5.16, p < .01). The variance 
of the by-task adjustment to the time on task effect was Var(b,,) = 0.64, 
and that of the by-person adjustment was Var(b,,,) = 0.23, that is, the 
time on task effect varied across both reading tasks and readers. More- 
over, the time on task effect varied systematically in that the adjustments 
were linearly related to task easiness and individual skill level, respec- 
tively, as expected. The correlation between easiness of reading tasks and 
by-task adjustment was Cor(bo;, b,;) = —.52, and the correlation between 
individual skill and by-person adjustment was Cor(bo,,, b,,,) = —.78. The 
difference test showed that model M4 fit the data significantly better than 
model MO, x7(4) = 106.14, p < .001. 

The curves in Figure 5 (upper panel) indicate how for a given 
reader and reading task the probability for a correct response 
depends on time on task. The range of the time on task axis 
represents the empirical range of time on task in the selected tasks. 
The slope of the curves resulted from adding up the time on task 
effect and the adjustments to the time on task effect by task and by 
person. When considering a proficient reader (skill level of bo,, = 
1.61) and an easy reading task (easiness of bp; = 1.89), that is, a 
reading situation of low demand, the unadjusted negative effect of 
—.69 became much stronger, resulting in a negative time on task 
effect of —1.90 (plus line). However, in a situation of high de- 
mand, where a difficult reading task (easiness of by; = —0.77) was 
completed by a poor reader (skill level of bo, = —1.79), the 
curve’s slope was no longer negative but even slightly positive, 
that is, 0.55 (triangle line). In situations of medium demand, that is, 
a poor reader completing an easy task or an able reader completing 
a difficult task, the curves’ slopes are in-between. 
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Figure 5. Time on task effect by task and skill level for reading literacy 
(upper panel) and problem solving in technology-rich environments (lower 
panel). For combinations of two tasks (easy vs. hard) with two persons 
(less able vs. able), the probability of obtaining a correct response is plotted 
as a function of time on task. 


Problem solving. In the integrated model, a positive time on 
task effect of B, = 0.56 (z = 2.26, p = .02) was obtained. The 
variance of the by-task adjustment to the time on task effect was 
Var(b,;) = 0.89, and that of the by-person adjustment was Var(b,,,) = 
0.11. The correlation between easiness of problem solving tasks 
and the by-task adjustment to the time on task effect was 


Cor(bo;, b,;) = —.-63, and the correlation between individual 
skill level and the by-person adjustment to the time on task effect 
was Cor(bo,, b;,) = —.76. Again, the model comparison test 


indicated that model M4 fit the data significantly better than model 
MO, x7(4) = 76.77, p < .001. 

The bottom panel in Figure 5 shows the probability of obtaining 
a correct response as a function of the time on task for two selected 
tasks completed by two selected persons. In a situation of low 
demand, that is, a proficient problem solver (skill level of bo, = 
2.63) completing an easy task (easiness of by; = —0.67), the time 
on task effect decreases dramatically and becomes even negative 
and was estimated as —0.62 (+ line in Figure 5). However, in the 
situation of high demand where a difficult task (easiness of bo; = 
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—3.44) is completed by a poor problem solver (skill level of oy = 
— 1.66), the positive time on task effect of .56 becomes much 
stronger and was estimated as 1.69 (A line in Figure 5). If the 
demand is medium, that is, a less able person completes an easy 
task or an able person completes a difficult task, the curves’ slopes 
are in-between. 

Taken together, these results indicate that positive time on task 
effects are observed especially in highly demanding situations, 
where not-so-skilled readers or problem solvers are confronted 
with a difficult task. Presumably, they can partly compensate for 
task demands by allocating cognitive resources. If this interpreta- 
tion holds true, differential time on task effects should be observ- 
able on a within-task level as well. Specifically, if it is the strategic 
allocation of processing time that drives a positive time on task 
effect in problem solving tasks and difficult reading tasks being 
encountered by poor readers, on a within-task level the positive 
time on task effect should be confined to the processing of task- 
relevant parts of the stimulus. We tested this hypothesis as a last 
step. 


Decomposing the Time on Task Effect at the Task 
Level (Hypothesis 4) 


Using fine-grained time information extracted from log files, we 
decomposed the global time on task into several components that 
reflect particular steps of task solution. This was done at the task 
level for the problem solving task “Job Search,” which required 
screening a search engine results page (see Figure 1, lower panel) 
and visiting multiple linked Web pages. Two of five Web pages in 
this task meet the criteria specified in the instruction and have to 
be bookmarked to obtain a correct response. In Hypothesis 4, 
spending more time on the two target pages was expected to 
indicate strategic behavior associated with a higher probability of 
successful task completion. In contrast, a negative effect was 
assumed for spending time on the search engine results page, 
which did not provide any hints about the target pages. For the 
time spent on nontarget pages, a negative effect was also expected. 

First, logistic regression was used to predict the task success by 
time on task. The sample size for this analysis was 182. This 
analysis revealed a nonsignificant time on task effect of —0.29 
(z = —0.59, p = .55). As a second step, task success was predicted 
by the time spent on the search engine results page, the time spent 
on the two relevant Web pages, and the time spent on the three 
irrelevant Web pages. The obtained effect for time spent on the 
relevant Web pages was positive and significant as expected, 0.96 
(z = 2.53, p = .01), that is, spending more time on the target pages 
for evaluating the accessed information and monitoring the mul- 
tiple criteria for constraint satisfaction was associated with a 
higher probability of achieving a correct response. In contrast, for 
the time spent on the search engine results page, a significant 
negative effect of —1.78 was revealed (z = —2.97, p < .01). The 
time spent on irrelevant Web pages was not significantly related to 
task success (estimated effect of 0.13, z = 0.23, p = .82). As a 
measure of effect size, we computed Nagelkerke’s R*, which was 
.25, that is, about a quarter of the response variability could be 
explained by the component time predictors. This result pattern 
suggests that successful problem solvers quickly discarded the 
irrelevant search engine results page, whereas relevant pages meet- 
ing evaluation criteria were checked carefully. This pattern is fully 


compatible with the view that positive time on task effects in 
difficult tasks are due to a strategic allocation of cognitive re- 
sources, as already suggested by the moderation of the time on task 
effect by domain, task difficulty, and skill level. 


Discussion 


Computer-based assessment provides new possibilities to assess 
cognitive skills and underlying processes by measuring not only 
the outcome of a task but also behavioral process data that might 
be interpreted in terms of cognitive processes happening through- 
out task completion. This means that to some degree, data from 
computer-based assessments may be used to address research 
questions through means of process analysis that were previously 
confined to experimental research. This is of interest especially in 
combination with the rather large sample sizes obtained in educa- 
tional assessments (compared to lab experiments). Thus, while 
there used to be a tradeoff—either go with small sampies and deep 
process analysis or have large samples and test result data only— 
this tradeoff can be remedied to some degree by using process data 
from large-scale assessments. 

The goal of this study was to investigate the effect of time on 
task on task success in reading literacy and problem solving in 
technology-rich environments and to test potential moderating 
variables. Our central hypothesis was that the relative degree of 
strategic versus routine cognitive processing as required by a task, 
as well as the test taker’s acquired skill, determines the strength 
and direction of the time on task effect. Accordingly, our results 
revealed that the time on task effect was moderated by domain, 
task difficulty, and individual skill. 


Time on Task Effects in Reading Literacy 


For reading literacy, overall, a negative time on task effect was 
found, that is, brief times on task were associated with correct 
responses, and taking more time apparently was not related to 
greater task success. Very slow respondents thus fail on the task. 
This observation especially concerns easy reading tasks as shown 
by the negative correlation between task easiness and the task- 
specific time on task effect, which means that for easy tasks the 
time on task effect was more negative than for difficult ones. To 
put it simply, in very easy tasks, the correct solution was either 
obtained quickly or never. In contrast, for difficult reading tasks, 
this association got weaker and in some instances was reversed. 
Taking individual differences in reading skill into account, these 
findings were consistently extended, that is, with increasing read- 
ing skill, the time on task effect got more negative, whereas it got 
weaker or even positive with decreasing reading skill. Thus, for 
poor readers completing hard reading tasks, time on task showed 
a positive effect, whereas for proficient readers working on easy 
tasks, a very strong negative effect was found. The latter result 
means that the few proficient readers who did not master the easy 
reading tasks took more time than the majority of proficient 
readers who were successful. In contrast, in a group of less skilled 
readers, this time difference between correct and incorrect answers 
in the same tasks was less pronounced, as shown by the weaker 
negative time on task effect. 

The observed result pattern that incorrect responses are associ- 
ated with longer times on task has consistently been found for 
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other untimed performance measures as well, for instance, general 
knowledge tasks (Ebel, 1953), matrices tasks (Hornke, 2000), 
figure series, number series, verbal analogy tasks (Beckmann, 
2000), verbal memory tasks (Hornke, 2005), and discrimination 
tasks (for a review of reaction time research on this matter, see 
Luce, 1986). Hornke (2005) discussed how correct responses with 
short latencies are eye-catching. Incorrect responses in contrast 
may be preceded by an ongoing process of rumination and ulti- 
mately a switch to random guessing. This interpretation is consis- 
tent with our finding in that, especially for easy tasks, there is a 
strong negative time on task effect and also explains why, in easy 
reading tasks, generally skilled readers had a lower chance of 
getting the task correct when the response took longer. Similar 
effects were reported by Hornke (2000) and Beckmann (2000). 

Across the cognitive operations required in reading tasks, there 
was no significant variation of the time on task effect. Thus, 
differences in the time on task effect across tasks cannot be 
ascribed to the presence or absence of specific cognitive operations 
as outlined in the PLAAC framework. In line with our findings on 
the dependency of the time on task effect on task difficulty, the 
clusters of access, integrate, and evaluate tasks are not very well 
distinguishable by their level of difficulty. Other task features than 
the cognitive operations are hence responsible for the variation of 
the time on task effect with task difficulty. If our cognitive inter- 
pretation of time on task effects holds, it might be worthwhile to 
look for task features that drive task difficulty and differential time 
on task effects. Identifying these features might further contribute 
to clarifying the PIAAC reading tasks’ demands in cognitive terms 
and as such contribute to further advance the assessment frame- 
work. Therefore, as one future step, we intend to classify the 
PIAAC tasks, for instance, in terms of the transparency of the 
information, or the degree of complexity in making inferences (cf. 
OECD, 2009b). Task features such as these are not yet entirely 
covered by the aspects detailed by the PIAAC assessment frame- 
work. 


Time on Task Effects in Problem Solving 


For problem solving, overall, a significant positive time on task 
effect was found: Long times on task were associated with correct 
answers and short times on task with wrong answers. Similar to 
reading, the time on task effect varied significantly across tasks. 
For easy tasks, it was weaker and around zero, whereas for 
difficult tasks, it became even more positive. This means that when 
dealing with challenging problems, spending more time was asso- 
ciated with higher probability of giving a correct response. Across 
individuals, poor problem solvers could benefit more from spend- 
ing more time on a task than strong problem solvers. Although 
causal interpretations are not possible, this result suggests that poor 
problem solvers can compensate for their lack of general skill by 
putting in more effort when working on a particular task, espe- 
cially when this task is hard to solve. Thus, the difference in time 
on task between correct and incorrect solutions was greater for 
weak problem solvers than for strong problem solvers, which is the 
reverse of the finding for reading. 

The results on the time on task effect for reading literacy and 
problem solving show that the moderating role of task difficulty 
and person’s skill are similar for both domains, even though the 
overall effect is very different. The time on task effect may become 


similar between the two domains when considering the extreme 
cases in which a skilled person encounters an easy task or a less 
skilled person engages in a difficult task. In the first case, the 
resulting time on task effect is negative (even for problem solving), 
and in the second case, it is positive (even for reading literacy). 
Thus, across domains the strength and the direction of the time on 
task effects seem to be governed by skill and difficulty in the same 
way. Both high skill levels and easy tasks presumably are associ- 
ated with a large proportion of cognitive component processes that 
are apt to automatization (in easy tasks) or in fact automatized (in 
skilled persons), bringing about a negative time on task effect. In 
contrast, low skill levels and difficult tasks presumably are asso- 
ciated with the need to engage in controlled and thus time- 
consuming cognitive activity to a large extent, bringing about a 
positive time on task effect. 

Thus, on the one hand, problem solving and reading are conceived 
as involving different cognitive processes, and overall the relation of 
time on task to task success also clearly differs between the two 
domains. On the other hand, our results support the notion that 
combinations of tasks and persons form a continuum across the two 
domains ranging from automatic processing to controlled processing. 
Practicing a task may move a person—task combination to automatic 
processing. However, this is limited by the nature of the task. For 
instance, certain aspects of a problem solving task may become 
automated in skilled individuals, but not core aspects of problem 
solving, such as inducing rules or drawing conclusions. 

Our interpretations of the time on task effect are further backed by 
the in-depth analysis of the time-taking behavior in the sample prob- 
lem solving task “Job Search.” This analysis was based on time data 
that was assumed to reflect different steps of task solution and pre- 
sumably information processing. It revealed that only for time spent 
on steps that are necessarily needed to solve the tasks, that is, to visit 
and evaluate the target pages for multiple criteria, a positive time on 
task effect emerged, whereas for spending time on the noninformative 
search engine results page and the nontarget pages, negative or null 
effects were found. When spending time on the target pages, the 
problem solver is assumed to deal with the part of the problem space 
that enables one to move step by step to the knowledge state that 
includes the solution (Simon & Newell, 1971) or to integrate relevant 
information, rather than identifying various other aspects of the prob- 
lem (Wirth & Leutner, 2008). Thus, this finding supported our hy- 
pothesis that the positive time on task effect in problem solving tasks 
reflects the need for and the benefit from devoting time to strategic 
and controlled cognitive processing. This interpretation suggests that 
task success could depend on the time spent on relevant pages 
(however, time on task as well as task success might also be driven by 
a common cause such as motivation). The negative effect of time 
spent on the search engine results page may indicate the strategy to 
select Web pages based on the limited information provided there. 
Although this approach could in principal be useful to filter search 
results, in the given task the results page did not indicate whether 
search criteria would be met or not. Thus, lingering on a page that 
could not contribute to solving the task was in fact detrimental to 
succeeding. 


Time on Task and a Dual Processing Framework 


We derived our hypotheses on differential time on task effects 
both between and within domains by means of applying a dual 
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processing framework (cf. Fitts & Posner, 1967; Schneider & 
Chein, 2003; Schneider & Shiffrin, 1977) to reading and problem 
solving tasks used in the PIAAC study. The hypotheses thus 
derived were confirmed; hence, our results are consistent with the 
notion that positive time on task effects reflect the strategic allo- 
cation of cognitive resources, whereas negative time on task ef- 
fects reflect the degree of automatization. Although the findings 
are entirely consistent with the predictions derived from such a 
framework and further backed by analyses on a within-task level, 
this interpretation has to remain somewhat speculative for the time 
being. The information that can be gained from large-scale 
computer-based assessments (although providing much more in- 
formation than traditional paper-and-pencil based assessments) is 
still limited. Usually, the information stored in log files is ambig- 
uous as to its interpretation in cognitive terms. In this article, we 
have assumed that taking more time on more difficult tasks indi- 
cates engaged cognitive processing. Other interpretations of the 
pattern of results are yet conceivable. For instance, it might be the 
case that time on task effects also reflect differences in motivation, 
that is, test takers not only take more time to think about a task but 
also think harder—resulting in a confounding between depth of 
processing and time taken. Related to that, Guthrie et al. (2004) 
considered time on task as an indicator of engagement, which 
means to read a text attentively, concentrating on the meaning, and 
with sustained cognitive effort (see also Kupiainen et al., 2014). 
Issues such as these can only be resolved by combining the 
analysis of large-scale process data with research tools allowing 
for an even more fine-grained analysis of cognitive processes, such 
as eye movements or think-aloud techniques (see Rouet & Passer- 
ault, 1999). As a consequence, we aim at corroborating our results 
through experimental studies that combine actual large-scale test- 
ing materials and still more fine-grained assessments of cognitive 
processes in the future. 


Limitations 


In the present study, test takers were free to adapt their speed— 
accuracy compromise both within and between tasks, which has 
consequences on the interpretation of the obtained results. As the 
speed level of test takers was not controlled, the obtained variation 
in the association between time on task and task success across 
tasks may be due to different task difficulties as claimed in 
Hypothesis 2 or due to within-person differences in the selected 
speed level across tasks. However, the latter explanation does not 
seem plausible as there is empirical evidence for the assumption of 
stationarity of speed when completing power tests (cf., e.g., Gold- 
hammer & Klein Entink, 2011; Klein Entink et al., 2009). Station- 
arity of speed is also implied by the fixed level of accuracy which 
is a standard assumption in item response models (cf. van der 
Linden, 2007). 

As we did not manipulate the speed level of test takers exper- 
imentally, we cannot conclude that the predictor time on task has 
any causal effect on task success, which, however, is suggested by 
the positive time on task effect in those tasks requiring a higher 
level of controlled processing. In contrast, in tasks that can be 
completed more automatically and for which a negative effect was 
revealed, time on task should rather be conceived of as an indicator 
of competence in addition to the task result. 


As another limitation, the sample size of the present study and 
the number of responses per task, respectively, were quite limited 
for testing measurement models. Therefore, future research should 
aim at replicating the findings based on greater samples, for 
instance, from the PLAAC main study. Another important replica- 
tion goal would be to investigate whether results on the time on 
task effect are comparable across countries. 

In PIAAC the construct of problem solving in technology-rich 
environments was newly developed as was the measurement pro- 
cedure. Thus, future research will have to provide more informa- 
tion about this assessment’s validity and its predictive power. 
Moreover, the relation of problem solving in PIAAC to other 
problem solving measures and their theoretical underpinnings re- 
quires further clarification. There are several conceptual common- 
alities, for example, representing the difference between a current 
state and a goal state, defining a series of subgoals, and applying 
related nonroutine cognitive and behavioral operations to trans- 
form the given state into the targeted state, including progress 
monitoring. However, there are also remarkable differences. For 
instance, the construct of complex problem solving (cf. Funke & 
Frensch, 2007) assumes systems where complexity is defined by 
the number of elements and the relations among them. The prob- 
lem solving process is comprised of the acquisition of knowledge 
by means of exploration and the application of the obtained knowl- 
edge. Although acquiring knowledge or information is also a key 
aspect of problem solving as defined in PIAAC, acquired knowI- 
edge in a complex problem solving task represents the explored 
system of elements and relations itself. In contrast, in PIAAC 
problem solving, the explored system is just the medium carrying 
the information that is required to solve the task. However, an 
unfamiliar computer environment and unknown functionality 
would turn the problem solving in technology-rich environments 
task into a complex problem solving task (for technical problem 
solving, see, e.g., Baumert, Evans, & Geiser, 1998). Regarding our 
findings on problem solving as proposed by PIAAC, future re- 
search needs to show whether the pattern of results holds true also 
for other conceptions of problem solving that, for instance, are 
anchored in cognitive theory (see, e.g., Fischer, Greiff, & Funke, 
2012) or used in other large-scale assessments such as the Pro- 
gramme for International Student Assessment (PISA; cf. Greiff, 
Holt, & Funke, 2013). 


Educational Implications 


The present study frames the meaning of time in information- 
processing tasks by referring to models of skill acquisition and 
related individual differences. Therefore, although the analyses are 
based on assessment tasks, our results allow for some tentative 
conclusions on educational procedures in reading and problem 
solving instruction. Our results indicate that for learning and 
applying higher level cognitive skills, required component skills 
should be well routinized. If there is no established routine pro- 
cessing, for instance, when a poor reader encounters difficult 
reading tasks, information processing needs to rely on strategic 
processing as indicated by the reversed positive effect of time on 
task on task success. This means that for poor readers to be 
successful, they need to switch to compensatory behaviors, that is, 
reducing reading rate, looking back in text, reading aloud, and 
pausing, and/or compensatory strategies, that is, shifting attention 
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to lower level requirements and rereading text, to cope with their 
deficits. Following Walczyk’s (2000) compensatory-encoding 
model, “With enough time, any text can be vanquished!” (p. 565). 

From an instructional point of view this means that becoming a 
good reader or problem solver requires the development of self- 
regulatory and metacognitive skills necessary to know when an 
effortful, controlled processing mode is to be employed. In the 
controlled processing mode, appropriate compensatory mecha- 
nisms can be initiated that have been learned and incorporated 
before. This might, for example, mean that in the face of reading 
comprehension difficulties, a part of a text is reread or that in 
problem solving, time is taken to focus attention on relevant 
subgoals. 

As the individual time on task effect is assumed to reflect the 
way of processing information, it may help to further describe the 
individual performance level and to identify instructional needs. 
As suggested in Figure 4 (bottom left panel), average readers show 
a great variation in the time on task effect, suggesting various 
levels of automaticity of component skills. Moreover, the in-depth 
investigation of temporal patterns in highly interactive tasks such 
as problem solving tasks can point to deficits in the information- 
processing strategy (cf. Zoanetti, 2010). For instance, if, in the 
“Job Search” task, log file data would reveal that a problem solver 
spends much time both on nontarget pages and target pages, this 
pattern would suggest that the problem solver cannot process 
disconfirming information efficiently to quickly discard a nontar- 
get page. 

From an educational measurement perspective, the present study 
suggests that the meaning of time on task is not uniform. Thus, 
when collecting time information across tasks and individuals that 
are heterogeneous in difficulty and skill level, respectively, the 
role of time and its interpretation may differ. Regarding item 
response models including time as a regressor, van der Linden 
(2007) argued that time can only be interpreted uniformly as an 
indicator of speed if the tasks do not differ substantially in the 
amount of labor. In the present study where tasks differ consider- 
ably in the amount of information processing and problem solving, 
we take the different interpretations of time on task into account by 
letting its effect vary across tasks (random effect). 

All in all, the analyses and results reported here illustrate the 
potentials that lie in exploiting time on task, or fractions of it, that 
become available through computer-based assessments. They do 
however also clarify that any process measure must be cautiously 
interpreted, at least by taking a closer look at the particular tasks 
and their demands. Regarding the two constructs studied here, 
reading literacy and problem solving in technology-rich environ- 
ments, our study proves them to be quite different in terms of 
cognitive processing. Skill, task difficulty, and time on task do 
interact in different ways. As Wirth and Klieme (2003) have 
shown based on student assessment in a German national exten- 
sion to PISA, problem solving tests, especially computer-based 
problem solving tests, add to the traditional set of literacy dimen- 
sions. In structural models, problem solving skills can be clearly 
distinguished from traditional abilities such as reasoning (cf. 
Wiistenberg, Greiff, & Funke, 2012). These structural analyses and 
our in-depth analyses of processing time provide evidence that 
problem solving skills have to be separated from traditional edu- 
cational outcomes such as reading literacy. Problem solving is one 
of the most prominent examples of cross-curricular, nonroutine, 


dynamic 21st century skills that are currently aimed at as educa- 
tional goals and covered in large-scale surveys. Claims that these 
new skills are different from traditional outcomes have mainly 
been supported by pragmatic or philosophical arguments. Now, we 
see that even in terms of cognitive processing and time allocation, 
there is a difference between reading literacy and problem solving 
skills. 
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The Role of Time on Task in Computer-Based Low-Stakes SST L of 
Cross-Curricular Skills 


Sirkku Kupiainen, Mari-Pauliina Vainikainen, Jukka Marjanen, and Jarkko Hautamaki 


University of Helsinki 


The role of time on task (TOT) for students’ attainment in a low-stakes assessment of cross-curricular 
skills was examined using the log data collected in the computer-based assessment (CBA). Two structural 
equation models were compared: Model 1, in which students’ test scores were explained by grade point 
average (GPA) together with mastery and detrimental motivational attitudes, and Model 2, in which TOT 
was added to the model to mediate the effects of GPA and the 2 motivational constructs. Fitting the 
models to nationally representative data of 4,249 Finnish 9th graders (M,,. = 15.92 years) confirmed the 
hypothesis that investment of time plays a key role in explaining test scores in low-stakes assessment 
even when prior ability (GPA) is taken into account. It was also confirmed that the effects of the 
detrimental attitudes on students’ attainment were mediated by TOT. The study makes an important 
contribution to research regarding the role of motivational attitudes and TOT in low-stakes assessment, 
which is vital for the use of the assessment results in national and international benchmarking. It is 
concluded that log data provide a functional way to investigate time investment in CBA as an indicator 
for students’ effort, yielding relevant implications for educational psychologists. 


Keywords: time on task, low-stakes assessment, log data, motivational attitudes, cross-curricular skills 


Ever since the 1990s, there has been a growing interest in the 
wider cognitive and affective goals of education. These cross- 
curricular skills are believed to indicate readiness for new learning 
and for successful adaptation to the rapidly changing demands of 
the future, and they collectively represent one of the reasons why 
large-scale low-stakes assessments are now at the forefront of the 
national and international education scene. The most prominent 
example of such a low-stakes assessment that regularly adminis- 
ters tests of cross-curricular skills is the Organisation for Eco- 
nomic Co-operation and Development’s Programme for Interna- 
tional Student Assessment. Also, many curricular assessments, 
such as the International Association for the Evaluation of Educa- 
tional Achievement’s Progress in International Reading Literacy 
Study and Trends in International Mathematics and Science Study, 
are presented to students as low-stakes assessments. The high 
policy visibility of these studies has led to the results being used 
for benchmarking at the international and the country levels (e.g., 
America Achieves, 2013; Grek, 2009). Yet, the low stakes have 
been shown to lead to reduced validity and reliability when stu- 
dents are not putting their best effort into the assessment because 
of the lack of personal consequences. They have also led to 
underestimated norms in later high-stakes tests on the basis of the 
results (Barry, Horst, Finney, Brown, & Kopp, 2010; Wise, 2006). 
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The problem of low stakes has been countered by adding to the 
assessments self-report questions regarding the effort students 
have invested in the assessment, but the reliability of self-reported 
effort has been shown to be unreliable because of untruthful 
responding and not accounting for change in effort along an entire 
test session (cf. Wise & Kong, 2005). Another approach, available 
in computer-based assessment (CBA), has been to use log data to 
look at the time students invest in assessment tasks (e.g., Schnipke 
& Scrams, 1997; Wise & Kong, 2005), and advances made in CBA 
during the past decades look promising in this respect (Greiff et al., 
2013; Wang, Jiao, Young, Brooks, & Olson, 2008). A key research 
question in CBA has centered on response time (RT) as an indi- 
cator of student effort, which is a crucial prerequisite for the 
reliability and validity of the assessment results (Lee & Chen, 
2011). 

When interpreted as a time-related indicator for effort, RT can 
be seen to relate to Carroll’s (1963) notion of time on task (TOT) 
in learning. In his model, Carroll considered learning to be deter- 
mined by the ratio of the time needed for learning and the time 
spent on learning (see also Bloom, 1980; Karweit, 1982). Ever 
since, TOT has featured regularly in meta-analyses of factors 
pertaining to learning and to school achievement (Hattie, 2005; 
Scheerens & Bosker, 1997). Yet, unlike the literature on RT, 
Carroll did not refer directly to motivational factors in his model 
but used the term engagement as a conative term, referring to the 
act of being engaged. 

Drawing on the two research traditions of TOT and RT and 
using the log data afforded by CBA, in this article, we investigate 
the role time plays in students’ attainment in a cross-curricular 
learning-to-learn (LTL) assessment. In the Finnish framework, 
LTL is conceptualized as the interplay of an individual’s cognitive 
competence and motivational and affective characteristics, aroused 
in a learning situation. Accordingly, LTL is assessed with a test 
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comprising cognitive tasks and a questionnaire with multiple atti- 
tudinal scales (cf. Hautamaki et al., 2002, 2006). The present 
article focuses on the role of TOT in the combination of students’ 
motivational attitudes as disclosed in the questionnaire and their 
attainment in the LTL test, taking into account prior ability. In this 
study, the term prior ability is used to refer to students’ cognitive 
competence as captured in school achievement (see Atkinson & 
Geiser, 2009; Cattell, 1987, pp. 139-145), indicated by their grade 
point average (GPA), whereas the term fest attainment refers to 
performance in the assessment (i.e., the test score). In this, the 
study makes a fresh contribution to current literature regarding the 
reliability of low-stakes cross-curricular assessments for making 
psychological grounded recommendations for changing educa- 
tional systems (Olson, 2003). 

Carroll’s concept of TOT has been chosen to reflect the cross- 
curricular character of the assessment tasks. Unlike curricular 
assessment, the tasks require not only the application of the re- 
cently learned knowledge and procedures but the assimilation of 
the novel rules of the tasks and their application in the ensuing 
items. Accordingly, time has been operationalized as the time 
spent on a whole task instead of using item-specific RTs, bringing 
the study closer to Carroll’s TOT. Yet, by linking time use to 
motivational attitudes, the study relates to recent research on RT as 
an indicator for effort and, hence, also makes a contribution to 
literature in this field. 


TOT 


In his model, Carroll considers students’ aptitude for the task, 
their ability to understand instruction, and the quality of instruction 
as the main constituents of the time needed for learning (Carroll, 
1963, 1989, p. 26), whereas the time allocated (outer constraint) 
and the time a student is ready to spend (inner constraint) are the 
two constituents of the time spent on learning.’ The time needed 
and the time spent, together with the quality of instruction, deter- 
mine learning as the outcome. In her review, however, Karweit 
(1982) pointed out that much of the research on TOT concentrated 
on just the time spent on learning, ignoring the role of students’ 
ability to apply instruction as the component defining the time 
needed for learning. Hence, when reanalyzing the reviewed studies 
while taking into account students’ prior ability as a measure of the 
time needed, the role of TOT for learning diminished to a third of 
the effect reported in the studies not making this distinction (Kar- 
weit, 1982; see also Gettinger, 1985; Karweit & Slavin, 1981). 

Compared with earlier observation-based studies, the log data of 
CBA provide a relatively accurate measure of the time students 
spend engaged in learning or on assessment tasks. This allows 
relating TOT more rigorously with not only the results of the 
assessment but also with other data collected concurrently. Ac- 
cordingly, whereas earlier research on TOT mainly concentrated 
on just the interrelations between prior ability, TOT, and learning, 
in this article, we extend the focus to the role of students’ moti- 
vational attitudes in their readiness to exert themselves in the tasks. 
We bring a new point of view to TOT research, still considering 
the time needed and the time spent on learning but moving it closer 
to the interpretation of RT as an indicator for effort. In the present 
study, the object of prediction is students’ attainment in the low- 
stakes LTL reasoning tasks, and, on the basis of Carroll (1963, 
1989), it is hypothesized that TOT acts as a mediating factor for 


the effect of both prior ability (in this case, earlier school achieve- 
ment as indicated by GPA) and students’ motivational attitudes as 
determinants of the time students are ready to spend on the 
assessment tasks (inner constraint of time). 


RT 


Whereas TOT is generally used to refer to the time spent on 
learning in class or across school days, RT is used to indicate the 
time it takes a student to answer one specific item in an assessment 
(e.g., Lee & Chen, 2011; Schnipke & Scrams, 2002; Wise & Kong, 
2005). Much of the research on RT focuses on the comparison 
between high-stakes testing and low-stakes assessment, using RT 
as an indicator for effort. The expectation is that when test results 
have important personal consequences for the students (high 
stakes), they will put more effort into the test. When stakes are low 
at the personal level (low stakes), students are expected to balance 
test taking with other interests (e.g., trying to avoid mental exer- 
tion), leading to reduced effort. This is seen to affect the validity 
and the reliability of the results (Kong, Wise, & Bhola, 2007; Wise 
& DeMars, 2005; Wise & Kong, 2005). Accordingly, RT research 
provides empirical evidence on the relation of some motivational 
attitudes to students’ use of time in an assessment situation, pro- 
viding the link between RT and TOT elaborated on in the present 
study. 

Building on Schnipke and Scrams’s (1997) notion of solution 
versus rapid guessing behavior in speeded tests, Wise and Kong 
(2005) introduced the concept of response time effort (RTE) to 
describe the proportion of test items for which the examinees 
exhibit one or the other of the two behaviors. Of interest for the 
present study is that they found RTE was not related to academic 
achievement as indicated by Scholastic Aptitude Test scores. Ad- 
ditionally, Wise and DeMars found no relation—and no differ- 
ence— between students’ prior ability and their self-reported mo- 
tivation in high-stakes and low-stakes tests. Instead, they found a 
significant difference in test attainment between motivated and 
unmotivated examinees, leading them to suggest the use of moti- 
vation filtering when striving for reliable proficiency estimates 
(Sundre & Wise, 2003; Wise & DeMars, 2008). 

Alternatively, Goldhammer et al. (2014) showed in their study 
that the time students spend on successfully completing assess- 
ment tasks differs according to item difficulty (positive correla- 
tion), student ability (negative correlation), and domain. Unlike in 
reading literacy, in problem solving, the relation between TOT and 
task success was positive even for the easier tasks, indicating an 
additional difficulty inherent to tasks for which students do not 
have a preformed formula to use and for which effortful processing 
and, thus, a longer RT is needed. 

Research regarding RT supports the understanding that TOT 
acts as a mediating factor between students’ attainment in an 
assessment, their prior ability, and their motivational attitudes, 
even if neither TOT nor RT corresponds in their classical sense 


' Depending on what is considered to be a task, there may be alternative 
definitions of TOT. For instance, in this special section, Goldhammer et al. 
(2014) used the term time on task to refer to the time needed to respond to 
a single question or problem, which is comparable to our notion of RT. We 
use the term time on task to refer to the time needed to complete a whole 
LTL reasoning task with instructions, examples, and multiple items to be 
solved. 
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fully to the way TOT has been operationalized in the present study. 
Hence, it reflects the understanding of RT as (partially) an indi- 
cator for effort and the results regarding the relative role of 
cognitive competence and various motivational factors in explain- 
ing achievement (e.g., Spinath, Spinath, Harlaar, & Plomin, 2006). 


Present Study 


The study springs from a longstanding research project on LTL 
as one of the cross-curricular competencies the school is expected 
to foster in students (Csap6, 2007; Hautamiki et al., 2006; Hoskins 
& Fredriksson, 2008). The developed instrument focuses on the 
cognitive and affective factors salient for new learning and acces- 
sible in school-based assessment (Hautamiki et al., 2002, p. 5). 
The cognitive tasks are related to curricular content but not directly 
repeating it to require higher level application of general cognitive 
ability or reasoning in addition to the curricular knowledge and 
skills learned at school (Hautamiki et al., 2002; see also Adey, 
Csap6, Demetriou, Hautamaki, & Schayer, 2007, for the influence 
of education on general cognitive ability). For the self-report ques- 
tionnaire, which comprises scales for diverse motivational and affec- 
tive constructs shown to bear on present and future learning, three 
scales measuring attitudes positively related to learning and three 
scales measuring attitudes detrimental for learning were chosen. 

To address students’ investment of time in the assessment, we 
chose to use Carroll’s (1963) learning model to emphasize that in 
cross-curricular assessment students face the same constraints of 
time, motivation, and prior abilities as in all learning. However, the 
focus of the study is on students’ engagement and attainment in the 
assessment tasks instead of on their school achievement as in much 
of TOT research (Karweit, 1982). Therefore, we use students’ 
prior school achievement (GPA) as an indicator of ability for the 
next stage of learning (Atkinson & Geiser, 2009; Gustafsson & 
Carlstedt, 2006; Thorsen & Cliffordson, 2012) instead of seeing 
school achievement as just the end state (grades given in an 
end-of-year report). This reflects Snow’s (1990) model for educa- 
tional assessment, in which he posits the relations between rea- 
soning skills and school achievement as a cyclical transition from 
goal-relevant initial states to desired end states, which will, in turn, 
be the initial states for a next cycle of learning. 

To reflect the double focus of the study combining Carroll’s (1963, 
1989) concept of TOT in terms of time needed and time spent and 
research on RT in terms of the effect of attitudinal factors on time use 
set in the context of predicting students’ attainment in the LTL 
reasoning tasks, we posed four research questions: 


Research Question 1: How do students’ prior abilities as 
measured by school achievement (GPA) and their motiva- 
tional attitudes as disclosed in the self-report questionnaire 
implemented concurrently in the LTL assessment predict their 
attainment in the LTL reasoning tasks (LTL test score)? 


Research Question 2: How do the motivational attitudes pre- 
dict students’ TOT in the LTL reasoning tasks when school 
achievement (GPA) is taken into account? 


Research Question 3: How is TOT related to students’ attain- 
ment in the LTL reasoning tasks (LTL test score)? 


Research Question 4: How does TOT mediate the effect of the 
motivational attitudes and of school achievement (GPA) on 


students’ attainment in the LTL reasoning tasks (LTL test 
score)? 


On the basis of the reviewed literature, the following hypotheses 
were set: 


Hypothesis 1: School achievement (GPA) is a stronger predictor 
than motivational attitudes for the LTL test score, but both make 
independent contributions to it. More specifically, it is expected 
that higher GPA and stronger mastery attitudes predict a higher 
LTL test score, whereas detrimental attitudes have a negative 
effect on it (Klauer, 1988; Spinath et al., 2006). 


Hypothesis 2: School achievement (GPA) and mastery atti- 
tudes are positively related to TOT, whereas detrimental atti- 
tudes are negatively related to it (Carroll, 1963; Karweit & 
Slavin, 1981; Wise & DeMars, 2005). 


Hypothesis 3: TOT is positively related to the LTL test score 
(Carroll, 1963; Chang, Plake, & Ferdous, 2005; Goldhammer 
et al., 2014; Karweit, 1982; Kong et al., 2007). 


Hypothesis 4: TOT mediates the effect of school achievement 
(GPA) and motivational attitudes on the LTL test score (Car- 
roll, 1963; Karweit, 1982; Wise & DeMars, 2005; Wise & 
Kong, 2005). 


Method 


Participants 


The data were drawn from a nationally representative sample of 
Finnish ninth grade students in spring 2012. Class-based Bernoulli 
sampling was used with all of the ninth grade classes from each 
sampled school included in the study. Overall, 8,875 students in 82 
schools participated in the assessment, but in the present study, 
only the data of the 4,249 students (2,153 boys, 2,050 girls, 46 not 
specified) assigned for CBA are included. The mean age of the 
students was 15.92 years (SD = 0.40, range = 14.67-18.57 years). 


Procedure 


The assessment was conducted by a teacher using written in- 
structions. Because of the limited number of computers in schools, 
the assessment was conducted one class at a time. The students 
were allocated 90 min for the assessment, a time that had proven 
sufficient in previous assessments (i.e., Carroll’s allocated time or 
outer constraint). The number of missing responses in the last task 
used in this study was only slightly larger than in the first, and only 
2% of the students who completed the first task did not complete 
the last, implying that the time allocated was a close match with 
the time needed. 


Measures 


The LTL reasoning tasks. The six reasoning tasks used in the 
study, taken from the Finnish LTL test, measure reasoning in 
different domains. The tasks and their reliabilities are presented in 
Table 1. The reliabilities were acceptable for all tasks. 

Three of the six tasks (Deductive Reasoning, Missing Premises, 
and Analysis of Relevant and Irrelevant Information) were adapted 
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Table 1 
Number of Items and Reliabilities (Cronbach’s a) of the 
Cognitive Tasks and the Attitude Scales 


i ee LU UUUUEE EEE EEE 





Tasks and scales Items a 
Cognitive tasks 
Deductive Reasoning 6 .60 
Missing Premises 10 61 
Relevance of Information 10 pill 
Control of Variables 10 14 
Hidden Arithmetical Operators 10 82 
Invented Mathematical Concepts 10 .66 
Mastery attitudes 
Agency: Effort 3 19 
Mastery: Extrinsic 3 89 


Importance of School 3 .83 
Detrimental attitudes 


Means-Ends: Chance 3 19 
Means-Ends: Ability 3 .67 
Self-Handicapping 3 MLS 


from the Ross Test of Higher Cognitive Processes (Ross & Ross, 
1979). In the Deductive Reasoning task, students were presented with 
six items, each with two premises and followed by three possible con- 
clusions. For each of these, the students had to decide whether they 
were true or false on the basis of the premises. The student 
received one point for an item if all three conclusions were 
answered correctly. The maximum score for the task was 6. 

In the Missing Premises task, students were presented with 10 
items with one premise and the conclusion given. The students 
were to choose the second premise from among five alternatives 
that would make the conclusion valid. Only one of the conclusions 
was correct. The student received one point for each correct 
answer. The maximum score for the task was 10. 

In the Analysis of Relevant and Irrelevant Information task, 
which we refer to hereinafter as Relevance of Information, the 
students were presented with 10 arithmetic word problems that 
contained sufficient, insufficient, or extraneous information for 
solving the problem. The students were not asked to provide an 
answer to the arithmetic problems but to assess whether there was 
just enough, not enough, or even extraneous information given to 
solve them. The student received one point for each correct an- 
swer. The maximum score for the task was 10. 

The Control of Variables task is a modified version (Hautamaki, 
1984) of one of the science reasoning tasks of Shayer (1979), 
Pendulum, regarding the control of variables. It is based on one of 
the formal schemata identified by Inhelder and Piaget (1958). The 
students were presented with 10 items in the form of comparisons 
set in the world of Formula | races with four variables— driver, 
car, tires, and track—with two alternatives each. The students were 
to judge whether the single effect of the driver, car, tires, and track 
could be concluded from the comparison. There were seven com- 
parisons with three or four Yes—No choices for variables and three 
comparison sets to be complemented. The student received one 
point if all parts of the item were answered correctly. The maxi- 
mum score for the task was 10. 

The Hidden Arithmetical Operators task is based on the 
quantitative-relational arithmetic operators task of Demetriou and 
others (Demetriou, Pachaury, Metallidou, & Kazi, 1996; Dem- 
etriou, Platsidou, Efklides, Metallidou, & Shayer, 1991). The task 


comprised 10 items with one to four operators. For example, “(5 a 
3) b 4 = 6. In this task letter a/b stands for: addition (+) Hi 
subtraction (—) / multiplication (-) / division (+)?” The student 
received one point if all operators in the item were answered 
correctly. The maximum score for the task was 10. 

The Invented Mathematical Concepts task is a modified group 
version of Sternberg’s Triarchic Test (H version) Creative Number 
scale (Sternberg, Castejon, Prieto, Hautamaki, & Grigorenko, 
2001), in which arithmetical operators are conditionally defined 
depending on the value of the digits they combine (e.g., if a > b, 
lag stands for subtraction, else for multiplication). The task uses 
two operators with differing definitions and can comprise several 
operations in the same equation (“What is 4 lag 7 sev 10 lag 3?”). 
There were 10 items, each with four multiple-choice alternatives 
for correct solution. The student received one point for each 
correct answer. The maximum score for the task was 10. 

Motivational scales. The six motivational scales used in the 
study are taken from the motivational—affective battery of the LTL 
test. They were presented to the students concurrently with the 
LTL reasoning tasks presented above. The chosen scales fall under 
two constructs: mastery attitudes and detrimental attitudes. The 
scales and their reliabilities are presented in Table 1. All affective 
items were answered on a 7-point Likert-type scale ranging from 
1 (not true at all) to 7 (very true). The reliabilities of all scales 
were acceptable. 

Mastery attitudes. The construct comprises scales from three 
subfields of motivational theory: achievement goal theory (e.g., 
Elliot & Dweck, 1988; Harackiewicz et al., 2002), agency beliefs 
(e.g., Chapman, Skinner, & Baltes, 1990), and the internalized 
value of education (cf. Ryan & Deci, 2000). From achievement 
goal theory, the construct mastery extrinsic orientation (e.g., “Get- 
ting good grades at school is important to me.”) was included, 
tapping into the internalized value of high attainment that Ryan 
and Deci (2000) called “identification” or “integration” in their 
taxonomy of human motivation. Of the trio of means—ends beliefs, 
agency beliefs, and control expectancies, the construct agency: 
effort (e.g., “I work hard to do well at school”) was included, 
because of its direct reference to learning as an activity. Regarding 
the internalized value of education, the construct importance of 
school was included, covering students’ views on the relevance of 
school and studying in general (e.g., “I think we learn useful and 
important things at school”). 

Detrimental attitudes. Of the attitudes detrimental to learning, 
two were derived from the background of means—ends beliefs: 
means—ends: chance (e.g., “Failure at school is mainly due to bad 
luck”) and means—ends: ability (e.g., “Poor grades are due to lack 
of ability”; e.g., Niemivirta, 2002). The third scale included in the 
detrimental attitude construct was for self-handicapping (e.g., “I 
give up easily if my assignments look too demanding”), also 
deriving from the broader field of achievement goal theory (e.g., 
Urdan & Midgley, 2001). 

GPA. To indicate prior ability, students’ GPA was calculated 
using their self-reported school grades in Finnish, mathematics, 
English, history, and chemistry in the midterm report card received 
4 months before the assessment. The scale of the Finnish school 
grades runs from 4 (failed) to 10 (excellent). 

TOT. The TOT was extracted from the CBA log files for each 
student for each task. The log file registered the time when the task 
was opened and the time when it was submitted as finished. Each 
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task comprised an introduction to the task, one or more presolved 
example items, and six or more items for the student to solve. As 
all parts of a task were displayed on the same screen, TOT was 
available only on a whole task basis. The time was counted in 
seconds. 


Statistical Method 


In studying the effects presented in the research questions and 
the hypotheses, structural equation modeling (SEM) was used as it 
can incorporate all the effects into one simultaneous analysis. For 
structural equation modeling, Mplus 6.0 (Muthén & Muthén, 
1998-2010) was used. Because the variables were close to nor- 
mally distributed (skewness and kurtosis between —1 and 1) 
except for the time variables before logarithmic transformations 
(described below), we used maximum-likelihood estimation. The 
criterion for an acceptable model fit was comparative fit index 
(CFI) > .900, Tucker—Lewis index (TLI) > .890, and root-mean- 
square error of approximation (RMSEA) < .08. For descriptive 
statistics and outlier definition, we used IBM SPSS Statistics 21. 

Hypotheses 1-3 were tested by studying the direct effects in the 
two specified SEM models: Model 1, in which students’ test scores 
were explained by GPA together with mastery and detrimental 
motivational attitudes, and Model 2, in which TOT was added to 
the model to mediate the effects of GPA and the two motivational 
constructs. 

For testing Hypothesis 4, indirect effects in Model 2 were 
studied, as, according to Zhao, Lynch, and Chen (2010), mediation 
can be equated with an indirect effect (see also MacKinnon, 
Lockwood, Hoffman, West, & Sheets, 2002). If the direct effect is 
then not significant, the mediation is full (Zhao et al., 2010). If the 
direct and the indirect effects are significant and they both are 
positive or negative the mediation is partial (Zhao et al., 2010). In 
that case, the direct effect between the independent and dependent 
variables decreases after the mediator variable is added into the 
model (MacKinnon, Krull, & Lockwood, 2000). If the direct effect 
is positive and the indirect effect negative or vice versa, the 
mediation is competitive, which is also called suppression. Con- 
trary to partial mediation, the direct effect between the independent 
and dependent variables is strengthened after the suppressor vari- 
able is added to the model (MacKinnon et al., 2000). 

To test the significance of the indirect effects, 95% confidence 
intervals were produced under inspection. If the interval did not 
contain zero, the effect was considered significant. Because indi- 
rect effects are products of two (or more) effects, their standard 
errors and consequently their confidence intervals cannot be ob- 
tained in a straightforward manner. In Cheung and Lau’s (2008) 
simulation study, the standard errors and confidence intervals for 
indirect effects were most accurately estimated with a bias- 
corrected bootstrap method. In the present study, we used this 
method with 1,000 bootstrap replicates. 


Results 


Descriptives 


TOT. Because there is little empirical literature regarding the 
properties and the use of task-based time data in CBA, TOT was 


first studied independently from other measures. The distributions 
are presented in Figure 1. 

For determining possible outliers, graphical inspection (see 
Kong, Wise, & Bhola, 2007; Wise, 2006) was used as the skewed 
distributions do not allow for methods based on standard deviation 
(Cousineau & Chartier, 2010) and because outlier values would 
affect the standard deviations even after normalization of distribu- 
tions through transformations (Leys, Ley, Klein, Bernard, & Li- 
cata, 2013). On basis of the graphical visualization of the distri- 
butions, in four of the six tasks, there were outliers whose TOT 
was considerably longer than that of the other test takers. This 
might be due to the student not submitting a task before opening a 
new one in a new window and only later returning to submit the 
original one. Cutting points at the higher end were defined sepa- 
rately for each task, on the basis of where the relatively even 
distribution ended: In Deductive Reasoning, three students were 
categorized as outliers with TOT = 1,800 s; in Missing Premises, 
four students had TOT = 2,446 s; in Relevance of Information, six 
students had TOT = 1,710 s; and in Control of Variables, 13 had 
TOT = 1,560 s. In Arithmetical Operations and Mathematical 
Concepts, no student could be categorized as an outlier, as the 
TOT was relatively evenly distributed up to 2,972 s and 2,880 s, 
respectively. Unlike in RT studies where rapid guessing is a 
concern because of its negative effect on reliability and validity 
(e.g., Kong et al., 2007; Wise, 2006), no limit for outliers at the 
lower end of the time distribution was defined. This reflects the 
focus of the study, with rapid submission seen as one form of a 
lack of motivational engagement in the assessment tasks.~ 

The outliers’ TOT was coded as missing information, which 
were listwise excluded from the descriptive analyses. The descrip- 
tive statistics for each time variable are presented in Table 2. 

The distributions of times were skewed, so the measures were 
transformed (IBM SPSS 21, LG10) into logarithmic scales. This 
brought the measures to the recommended limits for maximum- 
likelihood estimation (see Kline, 2005). 

LTL reasoning tasks and the motivational attitude 
constructs. The descriptive statistics for all but the time vari- 
ables are presented in Table 3. The unit of measurement in the 
reasoning tasks was the percentage of correctly solved items. For 
the items of the six motivational scales, the raw scores of the 
7-point Likert-type scale were used. The scale for the school 
grades is from 4 (failed) to 10 (excellent). All variables were 
almost normally distributed (skewness and kurtosis between —1 
and 1) and the means were close to those in earlier national studies 
(e.g., Kupiainen, Marjanen, Vainikainen, & Hautamaki, 2011). 


SEM 


To test the four hypotheses, two structural equation models were 
tested on the CBA sample of students. To address Hypothesis 1, in 
Model 1, attainment in the LTL reasoning tasks (test score) was 
predicted by prior ability (GPA) and by mastery and detrimental 


? We ran an auxiliary analysis, filtering out students with very short TOT 
by the standard deviation method after logarithmic transformation of the 
time variables as presented by Cousineau and Chartier (2010). The results 
did not change substantially from the full data, indicating that the inclusion 
of these students in the continuum of TOT is well founded. 
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attitudes. To address Hypotheses 2—4 regarding the role of TOT, a 
TOT construct was added in Model 2 as a mediating factor. 

In the two models, the second-order latent factors of mastery 
and detrimental attitudes were calculated by first regressing the 
three-item scales on first-order factors of Mastery: Extrinsic, 
Agency. Effort, and Importance of School (mastery attitudes) and 
Means: Chance, Means: Ability, and Self-Handicapping (detri- 
mental attitudes). The 2 second-order factors were then used to 
predict the latent factor of the LTL test score, which comprised the 
scores of the six LTL reasoning tasks. Self-reported school grades 
in five subjects (Finnish, mathematics, English, history, and chem- 
istry) were regressed on the first-order factor of GPA, also used as 
a predictor of LTL test score. The fit indices for all of these 
measurement models were at least acceptable (CFI = .957—.989, 
TLI = .920-.967, RMSEA = .063—.079) except for the chi-square 
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statistic (136.718—499.037, df = 5-24, p < .001), which was 
expected to be significant because of the large sample size. The 
latent TOT variable was added to the second model as a mediating 
factor, comprising the logarithmic TOT variables for each task. To 
achieve a just acceptable model fit (CFI = .965, TLI = .895, 
RMSEA = .108) for this measurement model, we had to let two 
pairs of error terms correlate. 

First, Model 1 was fitted to the data (see Figure 2). The fit 
indices were acceptable (CFI = .932, TLI = .925, RMSEA = .051) 
except for the chi-square statistic (4,337.273, df = 365, p < .001). 
Because of the large sample size and the number of variables in the 
model, a significant chi-square statistic was expected. Even then, 
the chi-square value was quite large. For Model 1, the residual 
correlations were 0.05 on average. Nearly 34% (138/406) of the 
residuals had an absolute value above 0.05 and 11% exceeded 








Table 2 
Descriptive Statistics of the Time on Task Variables Before Logarithmic Transformations 
Task N Min Max M SD 

Deductive Reasoning 4,222 4 1,370 320.65 153.84 
Missing Premises 4,120 2 1,658 295.66 197.61 
Relevance of Information 4,211 3 1,287 254.22 133.67 
Control of Variables 4,138 3 32 215.74 110.14 
Hidden Arithmetical Operations 4,185 3) 2,972 450.08 444.15 
Invented Mathematical Concepts 4,206 3 2,880 294.74 238.5) 





Note. The times were measured in seconds. N = number of students; Min = shortest time on task; Max = 


longest time on task. 
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Table 3 

Descriptive Statistics of the Six Learning-to-Learn (LTL) 
Reasoning Tasks, Three Mastery Attitudes, Three Detrimental 
Attitudes, and Five School Grades in the Models 








Tasks and scales N M SD 

LTL reasoning tasks 

Deductive Reasoning 4,243 49.90 27.88 

Missing Premises 4,105 41.4] 20.94 

Control of Variables 4.148 33.47 250 

Relevance of Information 4,231 36.98 13.81 

Hidden Arithmetical Operators 4,200 45.98 24.09 

Invented Mathematical Concepts 4,164 25.01 22.97 
Mastery attitudes 

Agency: Effort 4,230 4.61 Heo) 

Mastery: Extrinsic 4,230 O22 123 

Importance of School 4,249 4.52 1.18 
Detrimental attitudes 

Means—Ends: Chance 4,230 Be) M25 

Means-Ends: Ability 4,230 3.60 ie 

Self-Handicapping 4,230 3.94 ea 
School grades 

Finnish 4,239 7.78 N22 

Mathematics 4,239 Tee i) 

English 4,224 Tas 1.32 

History 4,234 7.79 1.30 


Chemistry 4,238 7.39 135 


Note. N = number of students. 


0.10. Finally, 4% of the residuals—mostly for items measuring 
detrimental attitudes—had an absolute value of 0.15 or larger. The 
items were, however, kept in the analyses, as they were considered 
a necessary part in the theoretical framework of the current study. 

Hypothesis 1. We expected that prior school achievement 
(GPA) and mastery attitudes would predict the LTL test score 
positively, whereas detrimental attitudes would have a negative 
effect on it. Moreover, we expected GPA to be the strongest 
predictor of the LTL test score. GPA (calculated from self-reported 
school grades received 4 months prior to the data collection) was 
indeed the strongest predictor, and the relation between GPA and 
the LTL test score was strongly positive (B = .65, SE = .02). This 
result supported earlier literature regarding the role of both prior 
cognitive ability in explaining school achievement (e.g., Deary, 
Strand, Smith, & Fernandes, 2007; Rohde & Thompson, 2007) and 
GPA as an indicator for later cognitive performance (Atkinson & 
Geiser, 2009; Gustafsson & Carlstedt, 2006; Thorsen & Clifford- 
son, 2012). Furthermore, there was a positive correlation between 
GPA and mastery attitudes (r = .63, SE = .01), whereas detri- 
mental attitudes correlated negatively with GPA (r = —.43, SE = 
.02) and with mastery attitudes (r = —.30, SE = .02). Detrimental 
attitudes also had a negative relation with the LTL test score 
(B = —.31, SE = .02). All correlations were statistically signifi- 
cant. 

Contrary to Hypothesis 1, the relation of mastery attitudes and 
the LTL test score was close to zero when students’ GPA and 
detrimental attitudes were controlled for. Despite their positive 
bivariate correlation (r = .43), the path coefficient between mas- 
tery attitudes and the LTL test score was negative (B = —0.08, 
SE = .02). 

Together, GPA and mastery and detrimental attitudes explained 
61% of the variance in the LTL test score. Thus, even without 


taking into account TOT, the model predicted the LTL test score 
fairly well. Hypothesis 1 was supported regarding the effects of 
GPA and detrimental attitudes but not for the expected positive 
effect of mastery attitudes on the test score after controlling for 
GPA. 

To address Hypotheses 2, 3, and 4, we fit Model 2, in which 
TOT was added to the set of constructs already used in Model 1, 
to the data (cf. Figure 3). It provided a just acceptable fit (CFI = 
901, TLI = .891, RMSEA = .056), x7(544) = 7,853.248, p < 
.001). Again, the absolute values of nonredundant residual corre- 
lations were .05 on average. About 36% of the residuals were 
above 0.05 in absolute value and 10% were larger than 0.10. 
Finally, 4% of the residuals exceeded 0.15 in absolute value. 
Again, almost all of the large residuals were associated with the 
items measuring detrimental attitudes and they were nevertheless 
kept in the analysis for theoretical and validity reasons. 

Hypothesis 2. The second hypothesis was that higher GPA 
and stronger mastery attitudes would both make students invest 
more time in the assessment tasks, whereas a high level of detri- 
mental attitudes would make them invest less time. From Model 2, 
we can see that detrimental attitudes were a relatively strong 
negative predictor of TOT (8 = —.41, SE = .02), whereas mastery 
attitudes and GPA predicted it positively (8 = .21, SE = .02, and 
8B = .15, SE = .02, respectively). We conclude that Hypothesis 2 
was supported. 

Hypothesis 3. The third hypothesis was that TOT would be 
positively related to the LTL test score. Figure 3 shows that adding 
TOT into the model increased significantly the share of the ex- 
plained variance in the LTL test score (81% vs. 61% of Model 1) 
and that the standardized path coefficient from TOT to LTL test 
score was relatively high (8 = .57, SE = .02). Thus, Hypothesis 
3 was supported. 

Hypothesis 4. The fourth hypothesis was that TOT would 
mediate the effects of motivational attitudes and GPA on the LTL 
test score. For GPA, the standardized indirect effect on LTL test 
score was significant but weak (8 = 0.08, with a 95% bootstrap 
confidence interval [0.06, 0.11]). The direct effect was strong even 
when TOT was taken into account (8 = .65, SE = .02, of Model 
1; cf. B = .57, SE = .02, of Model 2). Therefore, the mediating 
role of TOT between GPA and LTL test score was, at best, partial. 

The indirect effect of mastery attitudes on the LTL test score 
was significant but relatively weak (8 = 0.12 with a 95% bootstrap 
confidence interval [0.09, 0.15]). The direct effect was also sig- 
nificant (8 = —.20, SE = .02) but negative. This, and the fact that 
in Model | the direct effect was weaker (8 = —.08, SE = .02), 
indicates a suppression effect. In Model 2, the total effect of 
mastery attitudes on the LTL test score (0.12 + —0.20 = —0.08) 
was equal to the direct effect in Model 1. Thus, adding TOT in the 
model did not alter the overall influence of mastery attitudes on the 
LTL test score. Moreover, even if mastery attitudes predicted TOT 
positively, this increase in effort did not seem to convert into a 
better achievement in the LTL test. 

For detrimental attitudes, the mediating role of TOT was clear: 
The standardized indirect effect of detrimental attitudes on the 
LTL test score was —0.23 (SE = 0.02) with a 95% bootstrap 
confidence interval [—0.26, —0.20]. The direct effect was signif- 
icant but weak (8 = —.07, SE = .02), whereas in Model | it was 
much stronger (8 = —.31, SE = .02). Because the direct effect in 
Model 2 was practically nonexistent, although significant, this can 
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Figure 2. Model 1: Predicting learning-to-learn (LTL) test scores by grade point average (GPA), mastery 
attitudes, and detrimental attitudes. All of the coefficients are statistically significant, p < .001. Numbers in 
parentheses after variable names indicate variance accounted for. Residual variances are not displayed. 
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(almost) be seen as a case of full mediation. We conclude that 
Hypothesis 4 was fully supported for detrimental attitudes and 
partially for GPA and mastery attitudes. 


Discussion 


The objective of the study was to investigate the role of TOT in 
a computer-based low-stakes assessment of cross-curricular LTL 
reasoning tasks. Theoretically, the study relates to two traditions of 
time-related learning research: Carroll’s (1963) classical model on 
TOT in learning and the newer assessment-related research on RT 
(Schnipke & Scrams, 1997). This dual background reflects the 
double focus of the study on the impact of time use and on factors 
affecting time use. Accordingly, the findings of the study contrib- 
ute to both strands of research, and they are of special importance 
for the growing field of cross-curricular assessment and low-stakes 
testing. 

In this study, we first investigated the role students’ prior ability, 
as indicated by GPA, and their mastery and detrimental motiva- 
tional attitudes, as disclosed in a self-report questionnaire, had on 
their attainment in the cross-curricular LTL reasoning tasks. On 


the basis of earlier literature, we hypothesized that GPA and 
mastery attitudes would predict higher attainment (LTL test score), 
whereas detrimental attitudes would predict a lower LTL test 
score, and that the effect of GPA would be the strongest. After this, 
the role of the time students spent on the assessment tasks was 
explored, using the log data collected in the computer-based as- 
sessment. It was hypothesized that higher GPA and stronger mas- 
tery attitudes would make students invest more time in the assess- 
ment tasks, whereas a high level of detrimental attitudes would 
make them invest less time. Furthermore, it was expected that TOT 
would be positively related to the LTL test score and that TOT 
would mediate the effects of motivational attitudes and GPA on 
the LTL test score. 

To investigate the hypotheses, we specified and fitted two 
structural equation models to a nationally representative data of 
4,249 Finnish ninth grade students. In Model 1, to explore the first 
research question, the LTL test score was predicted by GPA and by 
mastery and detrimental motivational attitudes. The hypothesis 
regarding the role of GPA and of detrimental attitudes was sup- 
ported, with higher GPA predicting a better LTL test score and 
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Figure 3. Model 2: Predicting learning-to-learn (LTL) test scores by grade point average (GPA), mastery 
attitudes, and detrimental attitudes, taking into account the time the students invest in tasks. All of the 
coefficients are statistically significant, p < .001. Numbers in parentheses after variable names indicate variance 


accounted for. Residual variances are not displayed. 


detrimental attitudes a weaker LTL test score and the predictive 
power of GPA being far stronger. Contrary to the hypothesis, 
however, mastery attitudes had a very weak negative direct effect 
on the LTL test score despite the positive correlation between the 
two when GPA was not accounted for. This was interpreted to 
indicate that students’ mastery attitudes get fully rewarded in their 
GPA, indicating the central role of mastery attitudes in the process 
where students build their subject-specific achievement through 
the use of general cognitive ability (cf. Adey et al., 2007). 

In Model 2, TOT was added to Model 1 as a mediating factor. 
First, the relation of GPA and the two attitudinal constructs to TOT 
was studied. It was confirmed that GPA and mastery attitudes 
predicted TOT positively but their effect in explaining TOT was 
weaker than expected. Furthermore, as hypothesized, detrimental 
attitudes predicted TOT negatively and their effect in explaining 
TOT was relatively strong. 

The comparison of Model 1 and Model 2 confirmed that TOT 
plays a central role in explaining the LTL test score even when 
prior school achievement (GPA) is taken into account, increasing 
the explained variance in the LTL test score from 61% to 81%. It 


was also confirmed that TOT mediates the effects of GPA and 
detrimental attitudes on the LTL test score. For detrimental atti- 
tudes, the mediating effect of TOT was almost full, whereas for 
GPA, it was relatively weak. However, regarding Carroll’s notion 
of the relation of time needed and time used, this was to be 
expected. The adding of TOT to the model did not change the total 
effect of mastery attitudes on LTL test score. Moreover, even if 
mastery attitudes affected TOT positively, this increase in effort 
did not seem to convert into a better achievement in the LTL test. 

Overall, the results support the general finding of the strong 
positive relation between cognitive ability and school achievement 
(Deary et al., 2007; Rohde & Thompson, 2007), even if in the 
present study the direction of prediction was from prior school 
achievement to attainment in the assessment tasks. The results also 
support the understanding of the role of motivation and other 
affective factors on school achievement (Deci & Ryan, 2000; 
Harackiewicz et al., 2002; Little, Lopez, Oettingen, & Baltes, 
2001). 

By providing empirical evidence regarding the relations be- 
tween motivational attitudes, school achievement, and cross- 
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curricular LTL reasoning skills, the findings make an important 
contribution to the limited literature on the impact of affective 
factors on school achievement when controlling for students’ 
general cognitive ability (Gagné & St Pére, 2002; Spinath et al., 
2006). Moreover, the reversed focus of the study to predict stu- 
dents’ attainment in the LTL reasoning tasks instead of school 
achievement sheds unexpected new light on the mutual relations of 
motivational attitudes, cognitive ability, and school achievement 
(cf. Demetriou, Spanoudis, & Mouyi, 2011). 

The confirmed relations between GPA, TOT, and students’ LTL 
test score, as well as the added explanative value of Model 2 
compared with Model 1 in regard to students’ attainment in the 
LTL reasoning tasks, provide empirical support to Carroll’s (1963) 
model on TOT. The evidence is of special value, as the object of 
prediction was students’ attainment in tasks requiring on-the-spot 
learning and application of novel rules, and the time spent was 
measured using the CBA log file, not commonly used earlier in 
research based on Carroll’s construct. 

Furthermore, the strong mediating role of TOT in the relation 
between detrimental attitudes and the LTL test score gives support 
to Wise and Kong’s (2005) interpretation of RT as an indicator for 
student effort. Yet, although they saw students’ use of solution 
versus rapid guessing behavior to be related to self-reported effort 
and to explain differences in test scores, they did not look for 
further explanations for these differences in RT effort. The find- 
ings of the present study help to answer this question. Regarding 
the current widespread use of low-stakes assessment in national 
and international benchmarking, the finding of the strong negative 
impact of detrimental attitudes on students’ TOT and consequently 
on their attainment is of prime importance. In this, the study makes 
an important extension to Wise and his colleagues’ (e.g., Wise, 
2009; Wise & DeMars, 2005, 2008; Wise & Kong, 2005) research 
regarding the threat low effort presents to the validity and reliabil- 
ity of assessment data. 

The lack of direct effect of mastery attitudes on the LTL test 
score despite its strong relation with GPA underlines the role of 
mastery attitudes in aligning students’ use of cognitive ability with 
the goals of the school, rewarded in better grades. This can be seen 
as just one phase of the continuous cycle of the further develop- 
ment of cognitive ability through engagement in subject-specific 
learning (cf. Adey et al., 2007; Gustafsson & Carlstedt, 2006). In 
this, the results support the claim of the Finnish LTL framework of 
these cross-curricular 21st century skills being fostered through 
subject-specific teaching and requirements, and they provide one 
answer to Demetriou et al.’s (2011) call for the education of early 
adolescents to focus on the development of thinking and problem- 
solving skills. 

The study was built on Carroll’s (1963) concept of TOT, but 
through its context of assessment and the use of CBA log data, it 
also related to recent research on RT. One way forward on this 
path of combining the two would be to study students’ time 
investment by separating the time students need to read and 
assimilate the task instruction from the time they use to solve each 
item (RT; Chang et al., 2005). Later, this could be used to develop 
monitoring systems for computer-based learning, allowing teach- 
ers to better follow individual students’ pace and quality of learn- 
ing. This would be of special help in supporting struggling students 
and could be used in the development of adaptive learning and 
assessment programs. It would also allow the teacher to better 


monitor students’ TOT behavior and the relation of their motiva- 
tional attitudes to learning. 

The present study confirmed that TOT mediates the effects of 
detrimental motivational attitudes on test attainment. A next step 
could be to test the models specified in this study with samples of 
younger students to see whether TOT would provide a tool for 
disclosing the effect of their more immature self-awareness on the 
modeling of the development of the relations of cognitive compe- 
tence, motivational attitudes, and school achievement (cf. Dem- 
etriou & Kazi, 2006; Demetriou et al., 2011; Harter, 1999; for a 
discussion of response bias, see Bachman & O’Malley, 1984; 
Buckley, 2009). 

The juxtaposition of the two models revealed the role TOT plays 
in success and its relation to other factors relevant to performance 
and new learning. The findings point out the advantage CBA log 
files offer in addressing time investment as a factor crucial for all 
learning but also for reliable assessment and benchmarking. 
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Computer-Based Assessment of School Readiness and Early Reasoning 


Ben6 Csap6, Gyéngyvér Molnar, and Jozsef Nagy 
University of Szeged 


This study explores the potential of using online tests for the assessment of school readiness and for 
monitoring early reasoning. Four tests of a face-to-face-administered school readiness test battery (speech 
sound discrimination, relational reasoning, counting and basic numeracy, and deductive reasoning) and a 
paper-and-pencil inductive reasoning test were transferred to an online platform and administered at the 
beginning of school to samples of first-grade children (the sample sizes were between 364 and 435). Results 
of the original and the computerized tests were analyzed to explore (a) whether the new scales were identical 
to the original ones; (b) how the change of media influenced the reliability of the tests; and (c) whether the 
migration into a new medium affected gender differences. Analyses indicated that measurement invariance 
held in a strict sense in the case of the inductive reasoning test (the migration did not change the general look 
of the test or the item types) and only partially for the speech sound discrimination test (neither the item type 
nor the scoring principle was changed). Measurement invariance did not hold for the 3 remaining tests. In 3 
tests—speech sound discrimination, relational reasoning, and deductive reasoning—the online versions 
demonstrated improved reliability. Only certain items of the numeracy test could be assessed on computer, and 
the reliability of the shortened test decreased. No differences were found between the 2 versions of the 
inductive reasoning test. Gender differences were explored for the speech sound discrimination test, and latent 
analyses indicated that measurement invariance did not hold. Girls’ performance was somewhat better, 
similarly to former face-to-face assessments, where girls performed slightly better than boys. These results 
encourage further research on the extension of computer-based assessment to early childhood education. 


Keywords: computer-based assessment, online testing, school readiness, inductive reasoning, early 


childhood assessment 


A large number of studies have highlighted the importance of 
smooth preschool-to-school transition and the successful first years of 
schooling from different perspectives. Research has paid increasing 
attention to identifying the conditions of a successful start in school- 
ing. Among these efforts, creating instruments for assessing school 
readiness and monitoring development at the beginning of schooling 
play an important role. A broad range of instruments, including 
observation protocols, tests, and test batteries, are available, which can 
be used to assess different aspects of general cognitive development 
as well as specific precursors of skills learners are expected to master 
at school. However, many instruments that have been proven valid 
and reliable under research or pilot conditions turn out to be too 
complicated to use regularly in schools. Sometimes they are not 
sufficiently precise if not used under standardized conditions or if not 
administered by specially trained teachers. In many cases, the time 
and human resources required to administer and score the tests pre- 
vent their frequent use. Technology-based assessment may solve 
these problems, but administering computerized tests to young chil- 
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dren before or at the initial stage of formal schooling may raise a 
number of questions concerning the validity of results obtained 
through technology-based assessment in young children. 

In this article, we explore the possibilities of online testing at the 
beginning of formal education by comparing traditional and digitized 
versions of five tests. Four of them are tests from the DIFER (Dlag- 
nosztikus FEjl6désvizsgal6 Rendszer—diagnostic system for assessing 
development) school readiness test battery, an instrument with a long 
developmental history (Nagy, 1980, 1987; Nagy, Jozsa, Vidakovich, 
& Fazekasné Fenyvesi 2004a, 2004b). The fifth is an inductive 
reasoning test prepared to measure learners’ general mental ability. 
These five instruments measure different psychological attributes, and 
their computerized versions require different technological solutions. 
This variety of instruments offers a number of possibilities to analyze 
the prospects for and limitations of technology-based assessment 
around the time of the kindergarten-school transition. As we focus on 
the applicability of technology, we deal only in brief with the general 
functions of school readiness tests and other instruments used for 
monitoring children’s development during the first school years. 


Assessment of School Readiness and 
Early Development 


Mastery of basic literacy and numeracy skills is the main goal of 
the first school years; therefore, school readiness tests are often 
composed of tasks that measure precursors to speaking skills, 
vocabulary, early reading, writing, counting, computing, reasoning 
(comprehending relations and inferential processes), and the ele- 
ments of behavior and social skills (attention, following instruc- 
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tions, and collaborating) that are necessary for working in class- 
room settings (Konold & Pianta, 2005). Longitudinal research 
indicates that early (preschool as well as first grade) mathematical 
and reading skills represent strong predictors of later achievement 
(Duncan et al., 2007; Hair, Halle, Terry-Humen, Lavelle, & 
Calkins, 2006; Magnuson, Ruhm, & Waldfogel, 2007; Merrell & 
Tymms, 2010). For example, Tymms, Jones, Albone, and Hender- 
son (2009) reported correlations ranged from .65 to .80 between 
kindergarten assessments and mathematics and reading achieve- 
ments measured in the first and fifth grades. 

A number of instruments have been developed to monitor early 
development (see C. E. Snow & Van Hemel, 2008), but only a few 
of them are used in regular educational practice due to theoretical 
and practical constraints. Among the theoretical problems often 
cited are the difficulties of defining the concept of school readiness 
and properly determining the purpose of testing. For instance, 
readiness may mean either readiness to learn or readiness to 
perform in a school setting (Carlton & Winsler, 1999). If readiness 
testing is focused on predicting school performance, then it will be 
the slow developers or low-performing children who are most in 
need of the developmental influences provided by school that are 
prevented from entering it (Shepard, 1997). To overcome these 
difficulties, a more complex conception of school readiness is 
proposed, a concept that also takes into account children’s cogni- 
tive, emotional, and social development (Blair, 2002). These issues 
are less crucial if school readiness tests are used as diagnostic tools 
and identification of deficiencies is followed by treatment. 

Early tests must take into account that the children assessed may 
not be able to read. Thus, these tests are usually individually 
administered with stimuli presented and instructions read by test 
administrators, who also record the answers. This limits the stan- 
dardization of testing conditions and leaves the process open to 
subjective influences and interpretations of test takers’ responses. 
Research on the quality of school readiness testing indicates that 
assessments made by teachers are often biased (as they are less 
strict with the children) when their conclusions are compared with 
results from objective assessment instruments (Mashburn & 
Henry, 2004). Despite these constraints, a number of school read- 
iness assessments are based on the direct observation of children 
(e.g., the Early Development Instrument; see Guhn, Janus, & 
Hertzman, 2007). The Performance Indicators in Primary Schools 
(PIPS) tests are used to monitor children’s development in the 
early years of primary school. Its Baseline Assessment (PIPS 
BLA) measures early reading; mathematics; phonological aware- 
ness; and personal, social, and emotional development on a 5-point 
scale (Merrell & Bailey, 2012). This instrument was used in a 
large-scale international study (iPIPS) to compare children’s early 
development in English-speaking countries. 

Although the majority of studies on school readiness assessment 
have focused on the cognitive domain, recent research identified 
several further factors, which play a crucial role in kindergarten- 
school transition and later development, such as self-concept, peer 
status, classroom contexts, and parenting (Bossaert, Doumen, 
Buyse, & Verschueren, 2011; McWayne, Cheung, Wright, & 
Hahs-Vaughn, 2012). Although there are still a number of open 
questions related to certain details of the content of school readi- 
ness assessment and the ways their data may be used there is a 
consensus that the availability of appropriate and easy-to-use mea- 
surement instruments is crucial to helping children to begin school 


successfully and to identify those who are in need of additional 
support (K. L. Snow, 2006). 


Computer-Based Assessments and the Context 
of the Present Study 


In educational practice, there may be different forces and inter- 
ests driving the search for better solutions and applications of 
technology to replace traditional (face-to-face and paper-and- 
pencil) forms of assessment. The main factors motivating the use 
of technology are improving the assessment of already established 
assessment domains (Csap6, Ainley, Bennett, Latour, & Law, 
2012) and measuring constructs that would be impossible or dif- 
ficult to measure without the means of technology (e.g., Complex 
Problem Solving; see Greiff, Wiistenberg, & Funke, 2012; Greiff, 
Wiistenberg, Holt, Goldhammer, & Funke, 2013; Greiff at al., 
2013). ; 

Computer-based (CB) and paper-and-pencil (PP) test compara- 
bility studies were among the most extensively researched ques- 
tions over the last two decades. Because of several advantages, CB 
assessment delivery has been gradually replacing traditional PP 
delivery as it permits the tailoring of tests to the individual char- 
acteristics of learners (e.g., adaptive testing), automated scoring 
(including promising developments in children’s speech recogni- 
tion) and immediate feedback, the inclusion of innovative item 
formats (e.g., multimedia elements, simulation, and dynamic 
items), precise control over the presentation of test stimuli, and 
reduced costs of test administration (see Price et al., 2009). One of 
the regular large-scale assessment programs, the Programme for 
International Student Assessment (PISA), is also gradually shifting 
from PP to CB assessments. In PISA (2006, 2009, 2012; see 
Organisation for Economic Co-Operation and Development, 2010, 
2011, 2014), CB assessments were offered as international options 
or took place in one of the innovative domains; in 2015, the major 
domains (reading, mathematics, and science) will be assessed with 
computerized instruments. 

A number of studies have been conducted in different knowl- 
edge and competence domains using a variety of educational tests 
to examine whether test delivery mode affects children’s perfor- 
mance (Clariana & Wallace, 2002; Kingston, 2009; Wang, Jiao, 
Young, Brooks, & Olson, 2008). The differences between PP and 
CB test performance in terms of validity and reliability, advantages 
and disadvantages, and the effects of background variables (gen- 
der, race/ethnicity, and technology-related factors, such as com- 
puter familiarity; Csap6, Molnar, & T6th, 2009; Gallagher, 
Bridgeman, & Cahalan, 2000) have been widely studied and well 
documented. Most of the recent media effect studies have indi- 
cated that PP and CB testing are comparable and that students 
prefer CB tests to traditional PP testing. Although the research 
results are inconsistent to some extent, comparability problems are 
likely to decrease over time as computers become more broadly 
accessible at schools (Way, Davis, & Fitzpatrick, 2006). Even 
though there is a lively debate over comparative studies, less 
attention has been paid to the effects of different delivery modes 
on different subgroups of the samples, and only a few studies have 
focused on testing very young learners in a technology-based 
environment (Carson, Gillon, & Boustead, 2011; Choi & Tinkler, 
2002). 
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The most widely studied subgroup differences are those be- 
tween girls and boys. Gender differences are routinely analyzed in 
large-scale assessment programs and are especially relevant in CB 
testing. The results of previous studies have revealed that the new 
media slightly changed the pattern of differences compared with 
the traditional PP assessments. In large-scale international PP 
assessments, the overall pattern is that boys and girls perform alike 
or boys do slightly better than girls in mathematics and science, 
whereas girls perform better than boys in reading. Boys usually 
perform better on the information-communication technology 
(ICT) literacy and computer familiarity test, and their better ICT 
skills may affect the results of CB tests. For instance, in the PISA 
(2006) study, science was tested in three countries using computers 
as well, and gender differences varied across the three participating 
countries on the Computer-Based Assessment of Science. How- 
ever, boys outperformed girls on average (Organisation for Eco- 
nomic Co-Operation and Development [OECD], 2010). In the 
PISA (2009) survey, electronic reading was an innovative assess- 
ment domain. On the Electronic Reading Assessment, girls out- 
performed boys on average, and this pattern was the same across 
all OECD countries (OECD, 2011). Horne (2007) reported similar 
results in reading and spelling tests. In general, no gender differ- 
ences were found on the computerized versions, whereas girls 
outperformed boys on the paper versions of the tests. A study 
compared the achievement of fifth-grade (11-year-old) primary- 
school children in inductive reasoning measured by PP and CB in 
a larger representative sample in Hungary and indicated no 
achievement differences between boys and girls in PP or CB test 
results (Csap6 et al., 2009). In the context of early testing, ana- 
lyzing gender differences may be essential for making existing 
instruments equally usable for boys and girls. 

One of the main tasks of current developmental efforts is to 
migrate well-established face-to-face or PP tests to the new tech- 
nology. However, whereas a switch to CB delivery is accompanied 
by some obvious improvements in efficiency, cost-effectiveness, 
and precision, further research is required to determine potential 
changes in reliability, ecological validity, applicability, and possi- 
ble biases when migrating testing to the new medium. Most 
previous mode effect studies compared PP and CB delivery modes 
only. In the present study, we explore the differences between 
individual face-to-face and online testing as well. 


The Development of the DIFER Test Battery and the 
Inductive Reasoning Test 


The development of the school readiness test battery, which is at 
the center of this study, started back in the 1970s, when the first 
large-scale assessment of young learners in Hungary explored a 
group of skills necessary for a successful start of schooling. The 
results of this work (Nagy, 1980) formed the foundations for 
developing an extensive instrument, the PREFER test battery, 
administered face to face (FF) by teachers and covering the most 
essential competencies needed to begin school successfully (Nagy, 
1987). After using it in educational practice for more than a 
decade, it was revised and renewed as the DIFER test battery 
(Nagy et al., 2004b). It has been used in several large-scale 
assessments to establish its reliability and predictive validity 
(Nagy et al., 2004a). Five DIFER tests were used in a longitudinal 
program, where they were the first instruments administered to a 


sample followed for 10 years (Csapé, 2007; Jézsa, 2004). Strong 
correlations were found between the DIFER test and later school 
achievement. For example, the results of the counting and basic 
numeracy DIFER test correlated at .60 with a counting test ad- 
ministered at the end of second grade, and the correlation remained 
.49 (with a mathematical reasoning test) at the end of fifth grade 
and .48 at the end of eighth grade, with the mathematics test 
administered within the framework of the National Assessment of 
Basic Competencies (Csapo, 2013) 

In educational practice, the DIFER can be used as a diagnostic 
instrument. Children are assessed regularly over time, and a record 
of their development is kept in a booklet. The development of 
those who lag behind may be stimulated by special purpose exer- 
cises. The DIFER is designed so that its administration does not 
require specific expertise; it can be administered by kindergarten 
and primary-school teachers. A major drawback of the test battery 
is that it must be administered face to face and individually. This 
is especially problematic for primary schools, where it is difficult 
to fit testing sessions into teachers’ and learners’ schedules. An- 
other issue is the objectivity of the test administration as teachers 
may read the instructions to children in slightly different ways, and 
the scoring of the responses may also vary. An online delivery of 
prerecorded voice instructions (with texts read by trained speakers) 
and automated scoring may solve these problems. As two out of 
the seven DIFER tests (social skills and writing) cannot be imme- 
diately digitized, the remaining five (speech sound discrimination, 
relational reasoning, deductive reasoning, inferential reasoning, 
and counting skills) were transferred to an online platform in this 
study. The characteristics of inferential reasoning do not differ 
much from those of deductive reasoning; thus, we omitted infer- 
ential reasoning from the analyses presented in this article. 

To increase the variety of the instruments, a PP inductive 
reasoning test and its digitized version were added to the four 
remaining DIFER instruments (speech sound discrimination, rela- 
tional reasoning, counting and basic numeracy, deductive reason- 
ing). The development of the inductive reasoning tests began in the 
early 1990s (Csapo, 1997). Several PP inductive reasoning tests 
have been in regular use for almost two decades both for 
mapping the development of inductive reasoning itself (Csap6, 
2007) and for measuring inductive reasoning performance as an 
indicator of the developmental level of higher order thinking 
skills (Csap6 & Nikolov, 2009). Later on, a second inductive 
reasoning test was constructed for the early grades, based on 
Klauer’s model of inductive reasoning (Klauer, 1989), and has 
been used in experiments to assess the effect of training (Mol- 
nar, 2011). Using item response theory (IRT) analyses, both 
tests were equated so that their results were represented on the 
same scale (common-person methods were used; Molnar & 
Csap6, 2011). Finally, computerized versions were created for 
both tests, and the effects of delivery mode were studied by 
comparing the PP and CB versions (Csapé et al., 2009). 


Research Questions 


In this study, we explore the possibilities of the application of 
online assessment in regular educational practice at the beginning 
of schooling. For this purpose, we apply FF and PP tests that 
already have established psychometric characteristics; we transfer 
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them to the new media and compare them by answering three 
research questions. 

1. Does the medium of delivery influence the results, or can the 
results of the tests be represented on the same scale (1.e., testing of 
measurement invariance)? 

2. If the tests differ between the two media, what influences 
these differences (i.e., psychometric properties)? 

3. Does changing the mode of administration affect gender 
differences (i.e., latent mean differences between boys and girls)? 


Method 


A number of constraints have to be taken into account when 
carrying out comparative assessments with children entering 
school using an emerging technology. 

1. Although online technology has had a relatively short devel- 
opmental history, it has been extensively piloted with schoolchil- 
dren of different age groups, but not yet with preschool children. 

2. To ensure comparability and to prevent the impact of school- 
ing, a very short period is available for testing; schoolchildren may 
only be assessed at the very beginning of the first school year. (All 
assessments reported here took place during the first weeks of the 
first school year.) 

3. As ecological validity is a main concern of the research, all 
assessment occurred in real school settings using the available 
infrastructure. 

These conditions were equally taken into account when the 
study was designed, data sources selected, and procedures planned. 


Participants 


Data for two DIFER tests (relational reasoning test and counting 
and basic numeracy test) were drawn from an assessment in which 
all Hungarian children of school-entering age were assessed with 
the DIFER tests. A subsample was randomly selected for further 
detailed analyses; we used these data in this study. Two further 
DIFER tests (speech sound discrimination and deductive reason- 
ing) were administered to different samples representatively drawn 
from the school-entering population. The PP inductive reasoning 
test was administered to a further representative sample. In each 
case, school classes formed the units of selection. The sample sizes 
and the attributes of the tests administered to the samples are 
summarized in Table 1. 

The digitized versions of all five tests were administered to 
different samples due to organizational issues. These samples were 
randomly drawn from first-grade children in Hungarian primary 


Table | 


schools. The online version of the speech sound discrimination test 
was administered to the same sample as its FF version. In this case, 
the order of modes was randomized, and there was a 2-week 
interval between the two testing sessions. 


Instruments 


The study is based on five tests that measure different skills 
essential for later learning. These key skills include (a) speech 
sound discrimination, a prerequisite of successful reading; (b) the 
ability to understand the meaning of words that denote relations; 
(c) number concept and basic counting skills; and basic (d) deduc- 
tive and (e) inductive reasoning skills, all of which are prerequi- 
sites to learning to read and to studying mathematics and science. 

An FF or PP version existed for each test, as described in the 
theoretical part of the present article. For the present study, we 
constructed electronic versions of the existing instruments, basi- 
cally by migrating the items to the new platform. The viability and 
success of the migration depended on the content of the assessment 
and the item type. In the process of test digitization, one of the 
central aims was to preserve as many features of the items as 
possible in order to make the two delivery modes comparable. The 
paper and screen layouts were identical or as similar as possible. 

Two out of the seven DIFER tests could not be implemented in 
the new medium. The FF social skills test is based on an obser- 
vation of the children’s behavior. This test proved to have high 
predictive validity in a longitudinal study, but it could not be 
realized on a computer. The writing test examined fine hand 
movement (fine motor skills), which is a precondition of learning 
handwriting. It was not possible to implement this with the avail- 
able technology. The other FF DIFER tests were converted into 
CB formats, although some items had to be omitted and the 
open-ended items were reformulated and converted into multiple- 
choice items to allow automated scoring. Only items implemented 
in both media were used in the comparative analyses. 

Speech sound discrimination test. This test includes 60 
items that measure the perception of phonemic contrasts. The test 
reveals whether children have good hearing and are able to differ- 
entiate some critical pairs of phonemes, for instance /v/ - /f/ and /b/ - 
/p/ (e.g., in the pairs of Hungarian words vonal-fonal and bont-pont). 

In the first part of the original FF version of the test, the 
administrator read two sentences; each one contained one of the 
words from the pairs and showed the matching picture depicting 
the object referred to in the sentence. Then the administrator read 
only one of the words. The children indicated their answer by 
pointing to the picture that matched the word the administrator had 


The Samples in the Study and the Attributes of the Tests (Number of Items, Reliability) 





Sample sizes 


Cronbach’s a 





Number SEER 

Test N (FF or PP) N (CB) of items FF or PP CB 

Speech sound discrimination* (FF) 364 364 60 887 938 
Relational reasoning (FF) 1,892 426 24 .796 844 
Counting and basic numeracy (FF) 1,895 435 13 812 770 
Deductive reasoning (FF) 424 402 32 743, 831 
Inductive reasoning (PP) 952 Sih 37 855 856 





Note. FF = face-to-face; PP = paper and pencil; CB = computer-based. 
“The FF and CB tests were administered to the same sample. 
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read. Finally, the test administrator scored and logged the answers 
on a scoring sheet. In CB mode, the same pictures were presented 
on the screen, and instructions were given online by a prerecorded 
voice. Children used headsets and heard the same voice of a 
trained speaker. They had to indicate their answer by using the 
mouse and clicking on the correct picture. An analogous English- 
language example could be the following: “This is a sheep (show- 
ing a picture of a sheep). This is a ship (showing a picture of a 
ship). Now I will only say one word. Point/click at the picture, 
which depicts it.” 

The second part of the test focused on children’s phoneme 
perception in fluent speech and on the correct pronunciation of a 
word depicted by a picture. Finally, the third and fourth subtests 
contained pairs of real or pseudowords or words that differed in 
one phoneme. Test takers had to decide whether the two words in 
each pair matched or not. 

Relational reasoning test. Understanding words that denote 
relations between different objects, attributes, or processes is a 
precondition of school learning. The DIFER contains four equiv- 
alent versions of relational reasoning tests both in FF and in CB 
mode, each containing 24 items. As their structure was identical, 
we use only one version in this analysis. In each test, there were 
eight relation words tied to space (e.g., inside, between), four 
relation words encoding quantity (e.g., odd, few), four relation 
words referring to actions (e.g., step on, step in), four relation 
words related to time (e.g., earlier, later), and, finally, four dif- 
ferent relational expressions encoding physical measures (e.g., 
“the shortest,” “the same length”). Figure 1 shows a sample item 
from the CB test; the picture was the same in FF mode as well. 

With FF administration, the instructions were given and the test 
administrator scored the answers. Children had to supply their 
answers by pointing to the matching picture(s). In the CB envi- 
ronment, instructions were given online; students had to provide 
their answers by using the mouse and clicking on the matching 
picture(s). 

Counting and basic numeracy skills test. The original test 
constructed for FF administration consisted of items that measure 
the understanding of the meaning of numbers, number relations, 
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Figure 1. Sample item from the computer-based version of the relational 
reasoning test. [In this picture, you can see a house and four birds. Click on 
the bird that is higher up than the others.] 


and basic mathematical thinking. Some items were based on oral 
counting, and, as the online platform is not yet able to handle oral 
responses, a number of items on the original numeracy test were 
omitted from the CB version. Only items that test recognition of 
quantities, numbers, and representations of numbers were kept. 
Figure 2 illustrates items on the CB version of the test. 

The FF and CB data collection proceeded the same way as in the 
sound discrimination test described before. 

Deductive reasoning test. Deductive reasoning was mea- 
sured with 32 open-ended, contextually embedded tasks in FF 
mode and with 32 multiple-choice tasks in CB mode to make 
automated scoring practicable and to allow immediate feedback 
after testing in the latter case. Each task began with two premises 
(statements), and children had to reach and formulate a logical 
conclusion. The context of the situations presented to them may 
have been familiar from everyday life, so it would have been 
possible for them to use real-world knowledge to formulate their 
conclusions. 

Inductive reasoning test. The structure of the inductive rea- 
soning test was based on Klauer’s (1989) definition of inductive 
reasoning. Klauer defined inductive reasoning as discovering reg- 
ularities by detecting similarities, dissimilarities, or a combination 
of both, with respect to attributes or relations to or between 
objects. This involved six classes in total (generalization, discrim- 
ination, cross-classification, recognizing relations, discriminating 
relations, and system formation). The test consisted of 37 figural, 
nonverbal items belonging to the six subclasses of inductive rea- 
soning described above. Figure 3 illustrates the items on the 
inductive reasoning test; the same pictures were used both in PP 
and CB modes. 

During the digitization of the test, all features of the items were 
preserved to make the two versions as similar as possible. For 
example, in the PP multiple-choice items, children had to circle or 
underline the letter or the picture, whereas in the CB format, they 
had to click on the same letter or picture to indicate their answer 
(see Figure 3). 


Procedure 


Two traditional delivery methods, FF and PP testing modes, 
were used. During FF administration, children were tested indi- 
vidually. The instructions for the items were read by test admin- 
istrators, most of whom were the children’s homeroom teachers, 
and children’s answers were recorded on a scoring sheet. The PP 
version of the inductive reasoning test was taken in the children’s 
regular classroom under the supervision of their class teachers. The 
scoring sheets of all tests were collected after the testing session, 
data were centrally processed, and no feedback was provided to the 
children. 

The online data collection was carried out via the eDia (Elec- 
tronic Diagnostic Assessment) platform through the Internet. Test- 
ing took place in the computer labs at the participating schools, 
using the available computers and browsers installed. A session 
lasted approximately 20—45 min, depending on the test. The items 
were automatically scored, and children received immediate feed- 
back (percent of correct answers) at the end of the testing session. 
The eDia platform allows the use of proxy servers. 
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Figure 2. Sample item from the computer-based version of the counting 
and basic numeracy skills test. [Click on the card with a drawing of only 
one thing on it.] 


Statistical Analyses 


In this article, we analyze the differences between paper-based, 
FF, and online tests. We not only examine whether tests presented 
in different modes are equivalent, but we also show where they are 
different, as one of the aims of this study was to support the design 
of better online instruments. To reach this goal, we applied several 
analyses, including computations based on classical test theory, 
confirmatory factor analyses within structural equation modeling 
(SEM; Bollen, 1989), to test the underlying measurement model 
and to test measurement invariance, and IRT. In this section, we 
only discuss the theoretical background of how SEM analyses 
were applied in the present study. 

Providing a meaningful interpretation of test scores and ensur- 
ing the comparability and validity of FF and CB test results is only 
possible if the structure of the construct does not change across 
delivery modes (Byrne & Stewart, 2006). That is, measurement 
invariance must be analyzed to examine whether test results are 
affected by the test medium and to ensure that the same constructs 
are being assessed in each group. If measurement invariance is 
sufficiently met, and, thus, structural stability exists, between- 
group differences can be interpreted as true and not as psychomet- 
ric differences in latent ability (Greiff et al., 2013). 

A number of approaches, statistical methods, and concepts are 
available to test measurement equivalence (Schroeders & Wil- 
helm, 2011). State-of-the-art methods share a common feature: 
The definition of the measurement model is provided through a 
comparison of the latent structure for several groups in a single 
model. The most prominent methods are those used to detect 
differential item functioning within the IRT approach (Raju, Laf- 
fitte, & Byrne, 2002) and multigroup confirmatory factor analysis 
(MGCFA) (Bollen & Curran 2006; Steenkamp & Baumgartner, 
1998; Vandenberg & Lance, 2000) within the SEM framework. 

In the present study, a between-subject design was used to test 
invariance by means of MGCFA. Weighted least squares, mean- 
and variance-adjusted (WLSMYV) estimation was applied, and 
THETA parameterization was used because all items were scored 
dichotomously (Muthén & Muthén, 2010). All measurement mod- 


els were computed with Mplus. Goodness of fit to the sample data 
was evaluated on the basis of multiple criteria. Different fit indices 
have been developed (Wu, Li, & Zumbo, 2007), and numerous 
cutoff criteria, such as the Tucker—Lewis Index (TLD, comparative 
fit index (CFI) = 0.90 or 0.95, and root-mean-square error of 
approximation (RMSEA) = 0.06 or 0.08, have been proposed to 
assist in determining model fit (see Byrne & Stewart, 2006; Fan & 
Sivo, 2005; Vandenberg & Lance, 2000). In this study, an absolute 
fit index (the RMSEA), a relative fit index (the TLI), and an 
incremental, normed fit index (the CFI) were used to evaluate 
model fit. Nested model comparisons were conducted using a 
special chi-square difference test for the WLSMV estimator 
(Muthén & Muthén, 2010). 

Testing for measurement invariance (Muthén & Muthén, 2010; 
Vandenberg & Lance, 2000) with categorical data involves a fixed 
sequence of model comparisons, testing different levels of invari- 
ance by comparing measurement models from the least to the most 
restrictive model by using MGCFAs. Measurement invariance 
exists if restrictions of model parameters in one model do not 
generate a substantially worse model fit in comparison to an 
unrestricted model. The procedure for testing measurement invari- 
ance is explained thoroughly by Byrne and Stewart (2006). 

Configural invariance investigates whether the basic model 
structure is invariant across groups (Byrne, 2008), that is, whether 
children in the CB and FF environments conceptualize the con- 
struct in the same way (Milfont & Fischer, 2010) and thus use the 
same conceptual framework to answer the test items (Vandenberg 
& Lance, 2000; Wu, Li, & Zumbo, 2007). Configural invariance 
indicates that the same item is an indicator of the same latent factor in 
each group, but factor loadings may differ across groups. When we 
tested configural invariance with categorical outcomes, thresholds and 
factor loadings were not constrained across groups, factor means were 
fixed at 0 in all groups, and residual variances were fixed at | in all 
groups. The next step is to test weak factorial invariance, that is, to test 
cross-group equality in the loadings. However, testing it for categor- 
ical data is not recommended (Muthén & Muthén, 2010), and thus 
weak factorial invariance was not tested (see, e.g., Greiff et al., 2013; 
Schroeders & Wilhelm, 2011). 

Strong invariance, as the subsequent step in testing measure- 
ment invariance, indicates that the variances for latent variables 
and the covariances between the latent variables are equal between 
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Figure 3. Sample item from the computer-based version of the inductive 
reasoning test. [Click on the three shapes that have one thing in common 
that the other two do not.] 
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CB and FF modes; that is, cross-group equality exists in the 
loadings and intercepts. When this was tested, thresholds and 
factor loadings were constrained so that they would be equal 
across groups, and residual variances were fixed at 1 and factor 
means at 0 in the FF group, whereas there were no constraints 
specified in the CB group (Muthén & Muthén, 2010). If strong 
factorial invariance did not hold according to the modification 
indices, partial strong invariance was tested. Strong factorial in- 
variance is the level at which latent mean comparisons can be 
conducted (Byrne & Stewart, 2006). 

Finally, strict factorial invariance indicates whether the CB and FF 
groups have the same item residual variances (Byrne, 2008). It re- 
quires cross-group equality in the loadings, intercepts, and residual 
variances. Therefore, in addition to the restrictions applied in strong 
factorial invariance, all residual variances were fixed at one in all 
groups, even though strict factorial invariance is not a prerequisite for 
media comparisons of latent factor means and variances. 


Results 


The presentation of the results is organized according to the 
research questions. First, we examine measurement invariance for 
each of the five tests (Research Question 1). Second, as we see that 
in some cases the scales did not remain identical, we study the 
direction of the changes (Research Question 2). Finally, we exam- 
ine whether changing the testing media has the same impact on 
boys and girls (Research Question 3). 


Research Question 1: Examining the Media Effect 
Through Analyses of Measurement Invariance 


The measurement invariance analyses were performed as described 
in the Method section. The results are summarized in Table 2. 





Table 2 
Goodness-of-Fit Indices for Measurement Invariance of the Tests 
Test Model x df 
Speech sound discrimination (1) 2905.7 2628 
(2) 2950.2 2663 
(2.1) 2935.9 2661 
(3) 3231.4 2700 
Relational reasoning (1) 389.9 97 
(2) 490.5 108 
(2.1) 396.0 105 
(3) 1027.1 119 
Counting and numeracy (1) 46.2 17 
(2) 243.4 22 
(2.1) 157.0 20 
(3) 268.9 op 
Deductive reasoning (1) 185.8 136 
(2) 293.1 146 
(2D) 264.4 142 
(3) 348.1 166 
Inductive reasoning (1) 1791.2 908 
(2) 1828.4 930 
(2.1) 1806.4 931 
(3) 1868.1 962 


For the speech sound discrimination test, both the FF and the CB 
versions were administered to the same sample, and a multivariate, 
single-level approach made it possible to test measurement invariance 
in a single-group analysis. First, we tested the confirmatory factor 
analysis at each of the two points in time to be sure the model fit well 
in both modes. Examination of the modification indices suggested 
that model fit would be significantly improved by changing the 
original model. According to the results of the LaGrange multiplier 
test, we needed to delete some 16 items from the analyses because of 
ceiling effects. The remaining 44 items fit the data in both modalities 
well (FF: RMSEA = .019, CFI = .943, TLI = .940; CB: RMSEA = 
.035, CFI = .915, TLI = .910). The strong factorial invariance model 
did not fit well and resulted in a significant decrease in fit relative to 
the configural invariance model. The examination of the modification 
indices suggested that model fit would be significantly improved by 
allowing the intercept for one item to differ between data collections 
and adding residual covariances between two items to the CB model. 
Partial strong invariance did hold. The observed differences in item 
means between PP and CB testing was due to factor mean differences 
(for one item, children in FF mode were expected to have higher item 
response) and a residual covariance in CB mode (two items proved to 
be correlated). Finally, we tested partial strict invariance, resulting in 
a significant decrease in fit relative to the configural model; that is, the 
relations of the items to the latent factor of speech sound discrimina- 
tion were not equivalent in PP and CB modes in a strict sense. 
However, strict factorial invariance is not a prerequisite for latent 
mean comparisons; in this case, partial strong factorial invariance is 
sufficient to compare latent means. 

The results regarding the strong factorial invariance model for 
relational reasoning indicated a significant decrease in fit relative 
to the configural invariance model. The modification indices sug- 
gested a freeing of the intercept for two items between PP and CB 


CFI ce RMSEA Ax? Adf? p 
908 906 017 

905 904 017 60.0 35 <.05 
909 908 017 44.8 33 >.05 
824 824 024 287.0 72 <.05 
971 961 051 

962 954 055 89.2 ll <.01 
O71 964 049 27.4 8 <.01 
910 900 081 463.9 22 <.01 
996 997 038 

984 978 093 128.2 5 <.01 
990 985 076 70.8 3 <.01 
983 981 087 169.6 10 <.01 
980 973 030 

942 927 049 75.9 10 <.01 
951 938 046 56.4 6 <.01 
928 921 051 34.0 14 <.01 
929 923 037 

924 919 038 42.0 22 >.01 
926 921 038 15 23 >.05 
921 916 039 55.0 32 >.05 





Note. Model: (1) = configural invariance; (2) = strong factorial invariance; (2.1) = partial strong factorial invariance; (3) = strict factorial invariance. 
CFI = comparative fit index; TLI = Tucker—Lewis Index; RMSEA = root-mean-square error of approximation. 

Ay? and Adf were estimated with the Difference Test procedure (DIFFTEST) in Mplus. When using weighted least squares, mean- and variance-adjusted 
estimation, x~ differences between models cannot be compared by subtracting x~ and df (Muthén & Muthén, 2010). 
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testing. The partial strong invariance model in which the intercept 
for these two items was allowed to differ between PP and CB 
groups fit better than the strong factorial invariance model, but still 
significantly worse than the configural invariance model. This 
means no (partial) measurement invariance could be established 
for the relational reasoning test. 

For counting and basic numeracy, strong factorial invariance 
model resulted in a significant decrease in fit relative to the 
configural invariance model as well. According to the LaGrange 
multiplier test, the intercept for Items 6 and 7 between PP group 
and CB group were freed. The partial strong invariance model fit 
better than strong factorial invariance model, but still significantly 
worse than configural invariance model. Thus, no measurement 
invariance could be established in this case either. 

The result for the invariance testing of deductive reasoning 
indicated a decrease in model fit for all levels of invariance; thus, 
there was no measurement or partial measurement invariance 
between the FF and CB testing modes for deductive reasoning. 
These data suggest that CB (online administration) does not mea- 
sure exactly the same construct as FF delivery does. To this end, 
mean differences between FF and CB groups could not be inter- 
preted as true differences in the underlying deductive reasoning 
construct; this could also be due to psychometric issues. One of the 
possible reasons for this is that information was more standardized 
in the CB environment; achievements in the CB environment were 
independent of the teacher’s attitude and judgment during data 
collection. This was not the case in the FF mode. 

The result for the invariance testing of inductive reasoning repre- 
sented no loss in model fit; even imposing the most restrictive con- 
straints did not lead to deterioration in model fit. The model of strong 
factorial invariance did not show a decrease in model fit compared 
with the model of configural invariance. Strict factorial invariance 
could also be established; that is, even residual variances proved to be 
equal across delivery media in a strict sense. 


Research Question 2: Differences Between the Tests 
Delivered in Different Media 


As shown in the previous section, depending on the content of the 
assessment and the item types, there are differences between the tests in 
respect of how the measurement scales changed when the items were 
transferred to the online platform. In this section, we compare the 
reliability of the tests delivered by the two media, examine the impact 
of the media on performance, and have a closer look at the item level 


differences by comparing the difficulties of the items in the two 
media. 

Reliability of the tests. The internal consistencies of the tests 
were examined by computing Cronbach’s alpha for each test. The 
DIFER tests were previously also administered to participants in 
the Hungarian Educational Longitudinal Program (5,000 > n > 
6,000), and the reliability indices of those two digitized in this 
study were high (relational reasoning: .726; counting and basic 
numeracy: .915), the relational reasoning test showing the lowest 
value (Csap6, 2007; Jozsa, 2004). In the present study, the reli- 
ability indices were slightly different but in general also good both 
in FF/PP and CB modes. They ranged from .743 to .887 in the 
FF/PP mode and from .770 to .938 in the CB mode. Generally, the 
reliability indices of the CB tests proved to be somewhat higher 
than those of the FF/PP test versions (see Table 1). 

The reliability value was already high for the FF administration of 
speech sound discrimination (.887), and it improved further (to .938), 
being the highest within this set of tests. This improvement may be 
attributed to the standardized voice stimuli. There were slight im- 
provements in Cronbach’s alpha for relational reasoning and deduc- 
tive reasoning. A major drop of reliability was observed for the 
counting and basic numeracy test. Although several items were 
dropped from the FF version because they required oral responses and 
it was not possible to implement this in the CB version, the reduced 
FF test consisting of 13 items still had a relatively high reliability 
(.813). The PP version of the inductive reasoning test was digitized 
without major changes. This was reflected in the reliabilities as 
Cronbach’s alphas of the two versions did not differ. 

The impact of the assessment media on performance. A 
meaningful interpretation of differences in test scores is only 
possible if the structure of the construct measured does not change 
across test media (Byrne & Stewart, 2006). That is, latent (and 
manifest) mean comparison can only be interpreted meaningfully 
if at least strong factorial invariance is established (Brown, 2006). 
According to the measurement invariance analyses, testing for 
latent mean differences was only meaningful in the case of speech 
sound discrimination and inductive reasoning tests, where partial 
strong and strict measurement invariance held, respectively. 

Latent mean comparisons were conducted by constraining the 
item intercepts of the observed variables equal and setting the 
latent factor means for the FF (or PP) group as reference group to 
zero (Byrne & Stewart, 2006). We also calculated performance on 
tests in percentages, summarizing the results in Table 3. We report 











Table 3 
Test-Level Achievement Differences Between Traditional (FF or PP) and CB Modes 
FF or PP (%) CB (%) Latent 

Test M SD M SD d M SE Pp 
Speech sound discrimination SES > 9.88 82.61 16.80 Do Sli mili <.01 
Relational reasoning* 80.86 15.44 78.03 17.86 sll =.67 11 <.01 
Counting and basic numeracy* 87.35 18.46 88.90 14.84 09 ae O07 ns 
Deductive reasoning* 70.69 14.34 63.59 20.87 39 OS .06 <.01 
Inductive reasoning 47.52 19.04 45.99 18.11 08 =sil(0) 06 = 





Note. d is the difference between the means in standard deviation units (Cohen’s d). Latent mean for the FF group was set to zero. FF = face-to-face: 


PP = paper and pencil; CB = computer based. 


* Measurement invariance does not hold. 
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the means also for those tests for which measurement invariance 
could not be established. 

The data indicate that performance on the speech sound dis- 
crimination test was significantly lower in the CB tests than in the 
FF test: it fell from a high (91.4%) to a still high but significantly 
lower level (82.6%). A similar drop was found in the case of 
deductive reasoning and a modest drop for relational reasoning. No 
significant decreases in mean differences were observed for the 
counting and basic numeracy test or the inductive reasoning test. 
The latter represents a PP-CB transition, and no significant differ- 
ence was found between the two versions. 

Item-level differences. A further way of analyzing the char- 
acteristics of CB testing is to have a look at the results at the item 
level. To do this, we computed the item difficulties for each item 
in both media. To illustrate the possibilities for this type of 
analysis, we present the results of the speech sound discrimination, 
deductive reasoning, and inductive reasoning tests. 

For the speech sound discrimination test, we once again took 
advantage of the two test versions being administered to the 
same sample and computed the item parameters on the basis of 
IRT scaling. We considered the entire item pool of the two 
versions of the test as items of a single test and calculated the 
item parameters. According to the analysis of the content, items 
that contained nonsense words proved to be most affected by FF 
administration. A possible reason for this difference may be that 
in the case of meaningless words, teachers helped children to 
find the correct answer. The same effect was observed for 
words where the lack of context could in part have been 
replaced by teachers’ helpful behavior, whereas the impact of 
FF administration was less apparent for words in complete 
sentences (only in the case of this latter test was a significant 
correlation found between the item difficulty parameters: r = 
.550, p < .01). These results also suggest that items presented 
by teachers lose their objectivity in some cases, especially if 
stimuli are taken out of their usual context. 

For the deductive reasoning tests, a much higher correlation was 
found between the item difficulties in the two media (r = .750, 
p < .01). The inductive reasoning test showed the most “regular” 
picture, in agreement with the previous observations. The items 
correlated to a very high degree (r = .948, p < .01), indicating that 
the PP and CB tests measure inductive reasoning skills very similarly, 
not only at the overall test level but also at the level of items as well. 
The inductive reasoning test is the only one where one of the versions 
was taken on paper. The same can be observed here as what we have 
already shown concerning the reliability and the difference between 
the means: The two versions of the test behave very much alike, so 
they can essentially be considered identical. 


Research Question 3: Gender Differences 


As previous studies have indicated, gender differences are influ- 
enced by the medium of testing. In general, boys perform somewhat 
better if they are assessed via technology-based instruments. In the 
previous FF versions of the DIFER tests, girls performed somewhat 
better on the speech sound discrimination test, whereas boys performed 
better on the counting and numeracy test (Jozsa, 2004). 

To examine how transferring the DIFER tests and the inductive 
reasoning test to the online platform affected gender differences at 


this very young age, we carried out several analyses. First, we 
performed invariance analyses with regard to gender. 

The model used to test configural invariance of speech sound 
discrimination for boys and girls fit well. The model of strong 
factorial invariance did not show a decrease in model fit based on 
the stricter perspective (nonsignificant chi-square difference test; 
cf. Table 4) compared with the model of configural invariance. 
Finally, strict factorial invariance could not be established. As 
strong factorial invariance held (see Meredith, 1993), mean differ- 
ences could be interpreted as true differences in the construct being 
measured between girls and boys (Byrne & Stewart, 2006). 

In the computerized versions of the tests examined here, a 
significant gender difference was only found for the speech sound 
discrimination test: Girls performed somewhat better than boys (in 
%, gitls: M = 86.37, SD = 12.76; boys: M = 80.40, SD = 18.03. 
t = —3.94, p < .001). No significant gender differences were 
found for the other tests. 


Discussion 


Measurement Invariance 


We found that measurement invariance held in two out of the 
five cases (if we consider the practical perspective and the strong 
invariance model sufficient) and, in a strict sense, only in one case, 
for the inductive reasoning test. This test was originally a PP test, 
and the migration of the items took place so that neither the general 
look of the test nor the item types were changed. 

Measurement invariance held only partially for the speech 
sound discrimination test. This was originally an individually 
administered FF test, and the same pictures were presented to the 
children in both modes. The item type and the scoring principle did 
not change either. This finding indicates that under certain condi- 
tions, even an FF test can be transferred to the online platform with 
acceptable results in terms of measurement invariance. This may 
also hold in part for the relational reasoning test. 

Measurement invariance did not hold for the three remaining 
tests. These results indicate that equivalent scales may only be 
constructed if the migration of the items does not change the item 
types and only changes the testing context moderately. If the 
migration influences the objectivity of scoring, the tests adminis- 
tered in the two media are not exactly identical. This happened on 


Table 4 
Goodness-of-Fit Indices for Measurement Invariance of the 
Speech Sound Discrimination Test for Boys and Girls 


Model x” df CFI TLI RMSEA Ay” Adf?  p 
(Ly ORS 7 2536" O32E O25 eres. 004 

(2) 2887.0 2580’ “935° 934° © 024° 44.8” “44h S05 
(3p ei2920, 2m, 5258890030 1998 aul L024 a rSO rasa) 05 


Note. Model: (1) = configural invariance; (2) = strong factorial invari- 
ance; (3) = strict factorial invariance. CFI = comparative fit index; TLI = 
Tucker—Lewis Index; RMSEA = root-mean-square error of approxima- 
tion. 

4 Ay? and Adf were estimated with the Difference Test procedure (DIFFTEST) in 
Mplus. When using weighted least squares, mean- and variance-adjusted 
estimation, yx” differences between models cannot be compared by sub- 
tracting x7 and df (Muthén & Muthén, 2010). 
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the deductive reasoning test as the open-ended items were con- 
verted to multiple-choice items. 

From an applied perspective, the scale differences may only cause 
problems if there is a need to compare performance within the two 
media. If the purpose of transferring the test is to construct a new, 
more applicable instrument, the lack of measurement invariance does 
not cause a problem. As the aim of the present study was to establish 
the development of a new instrument, analyzing further details of the 
differences where the measurement invariance does not hold may 
help to construct better tests. If the changes may be attributed to the 
improvement of the instrument, the lack of invariance may even be 
favorable, but in this case, the new instruments can only be used in 
practice after careful piloting and validation processes. 


The Impact of Media on Reliability and 
Achievement Scores 


The study has shown that the reliability of assessments may be 
improved by transferring individually administered FF instruments 
to an online platform. Having a look at the reliability coefficients 
of the five instruments in two modes, we may observe that the 
reliability increased in those cases when CB assessment provided 
more standardized conditions compared with FF administration. 

The results indicated that performance was lower on the CB 
tests than on the FF or PP tests in most cases. This suggests that 
teachers tend to give higher scores to children than the scores they 
receive when their responses are automatically (and objectively) 
scored. Another explanation for the differences might be that 
children had difficulty handling the computerized tests, and this 
lowered their performance. However, as no significant mean dif- 
ferences were found for the counting and basic numeracy test or 
the inductive reasoning test, low computer familiarity is not a 
sufficient explanation for achievement differences. In fact, a more 
realistic explanation may be that teachers were more tolerant in 
accepting children’s responses. 


The Impact of Media on Gender Differences 


Due to the limitation of the available data, gender differences were 
explored by latent analyses only for the speech sound discrimination 
test. The results indicated that measurement invariance did hold; thus, 
FF and CB testing was the same for boys and girls, and latent means 
could be compared. Transferring the test to the new medium affected 
their performance similarly. A comparison of the raw scores indicated 
that girls’ performance was somewhat better, similarly to former FF 
assessments, where girls usually performed slightly better than boys. 
Together, these data may indicate that using the new medium will not 
cause a major bias in future applications. 


Limitations of the Present Study 


Due to the context of the study (using a system that is still under 
development), there were smaller sample sizes available compared 
with previous large-scale assessments. The availability of comput- 
ers at school and the time first-graders were available to work on 
the computers limited the possible complexity of the study. The 
analyses are also constrained by the unavailability of additional 
background variables. In this phase of the research, we have no 
data concerning the possibilities of using the same instrument for 


repeated testing to monitor development. (The previous FF version 
of the DIFER is routinely used for this purpose without problems.) 
There are no data on the predictive validity of the CB instruments 
either. However, as indicated earlier, the FF version was admin- 
istered in a longitudinal study and proved to be a good predictor of 
later school performance. 

A further limitation of the present study is that the original 
DIFER tests were designed to assess children in the kindergarten- 
to-school transition period. Thus, students who have already 
started school tend to perform close to ceiling on these tests, and 
their data are not ideal for analyzing the characteristics of the tests. 
This deficiency may be rectified by extending future investigations 
to kindergarten populations, although the use of computers with 
that age group calls for further feasibility studies. 


Conclusions and Further Prospects for Online 
Assessment of School Readiness 


This study shows the potential and limitations of transferring 
school readiness tests to the new assessment medium of computers. In 
the digitization process, we have lost a strong and important test with 
high reliability and predictive validity as the social skills test cannot 
be replicated in the new medium while remaining close to its original 
form. We have also lost the writing test, but a closely related construct 
(fine hand movement) can easily be measured using emerging tech- 
nologies. In fact, an alternative construct (handling keyboard and 
mouse) can also be easily measured by computer. The relevant tech- 
nology is at hand for research purposes, and it will probably also be 
widely available in schools. We have also lost a large number of 
relevant items on the counting and basic numeracy skills test, as it was 
not possible to capture children’s oral responses, and we have paid for 
this loss with a drop in reliability. Further efforts are therefore needed 
to develop a suitable CB counting skills test. 

The study has demonstrated the applicability of technology- 
based assessment in regular school practice at the earliest possible 
point of schooling in a number of highly relevant competency 
domains. These assessments can be carried out practically any- 
time, at very low cost, and with almost no extra teacher time. 
However, devising and using such instruments require further 
research in at least three dimensions: (a) making constructs cur- 
rently assessed with traditional instruments measurable using com- 
puter technology and assessing new constructs that are especially 
well suited to CB assessment; (b) enhancing the online assessment 
technology with functionalities that are already in use in other 
areas of information technology (e.g., speech recognition and 
detection of emotions); and (c) exploring ways of integrating 
frequent early assessment into educational processes. 

A number of technological solutions that can be built into the 
online assessment system to enhance its capabilities already exist 
elsewhere. Interaction, simulation, and manipulation of objects on 
screen, and new types of stimuli, such as video and animation, are 
currently in use in some assessments. It is also possible to time the 
stimuli and control the presentation of information in other ways. 
Measuring response time, logging keystrokes, and mouse move- 
ment can also routinely be used, although further studies are 
required to explore how these methods may contribute to solving 
the real problems of early assessment. One example of a real 
problem, where an existing technological solution may be essen- 
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tial, emerges from the present study: Voice recognition technology 
is needed to make the counting test deliverable online. 

Further research is needed to explore the educational applica- 
bility of online assessment. Ecological validity is an issue that 
requires careful consideration. Examining predictive validity is 
crucial for tests that assess the preconditions of further learning 
and are used to identify early indicators of later problems. An 
examination of the online tests discussed here has already started 
as an extension of the present study. An exploration of the effects 
of repeated testing has also begun, but the accumulation of a 
sufficient quantity of data for analyses will take time in both cases. 
As learners’ assessment results can easily be stored in the online 
assessment system, it is possible to gather not only overall perfor- 
mance data but also information collected on behavioral processes. 
In addition, the growing information base facilitates adequate 
monitoring of learners’ development. 

In the past decade, the issue of early development has been 
approached not only by researchers in numerous fields of study in 
education and psychology but also by those in other social sci- 
ences, such as sociology and economics. Results from these com- 
prehensive studies have indicated that numerous problems that 
arise later are rooted in difficulties in the first school years. 
Research has also shown that these difficulties may be overcome 
with adequate intervention and that investing in such programs 
produces high returns. An important component of the well- 
prepared and well-timed intervention is proper diagnosis. To this 
end, CB assessment of the basic skills may be one of the best 
means to diagnose problems and monitor development. 
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Toward Automated Computer-Based Visualization and Assessment of 


Team-Based Performance 


Dirk Ifenthaler 


Deakin University 


A considerable amount of research has been undertaken to provide insights into the valid assessment of 
team performance. However, in many settings, manual and therefore labor-intensive assessment instru- 
ments for team performance have limitations. Therefore, automated assessment instruments enable more 
flexible and detailed insights into the complex processes influencing team performance. The central 
objective of this study was to advance knowledge in automated assessment of team-based performance 
using a language-oriented approach. Fifty-six teams of learners (V = 224) in 3 experimental conditions 
solved 2 tasks in an online learning environment. They were analyzed with the Automated Knowledge 
Visualization and Assessment (AKOVIA) methodology. AKOVIA integrates a natural language-oriented 
algorithm and enables a structural and semantic compression of individual- and team-based knowledge 
representations. Findings indicate initial evidence of the feasibility and validity of the fully automated 
methodology. A framework for integrating research and methodology development is suggested for 
improving educational technology innovations such as computer-based assessment environments in 


international large-scale assessments. 


Keywords: team, shared mental model, automated assessment, natural language processing 


Teams are a critical and essential part of most working envi- 
ronments because they combine different views, multiple skills, 
diverse experiences, analytical judgments, and rich knowledge. 
Consequently, research in teams and their assessment has been a 
continuous endeavor in various scientific areas for more than 30 
years. Yet, there exist various definitions of team using different 
perspectives. For example, Kanaga and Kossler (2011) defined a 
team as “a specific kind of group whose members are collectively 
accountable for achieving the team’s goals” (p. 4). A more detailed 
definition is given by Katzenbach and Smith (2003), who de- 
scribed a team as “a small number of people with complementary 
skills who are committed to a common purpose, performance 
goals, and approach for which they are mutually accountable” (p. 
45). From an operational point of view, Cohen, Levesque, and 
Smith (1997) defined a team as “a set of agents having a shared 
objective and a shared mental state” (p. 95). Salas, Dickinson, 
Converse, and Tannenbaum (1992) described a team as 


a distinguishable set of two or more people who interact dynamically, 
interdependently, and adaptively toward a common and valued goal, 
who have each been assigned specific roles or functions to perform 
and who have a limited life span of membership. (p. 4) 


To sum up, common characteristics of definitions of a team in- 
clude at least two individuals, common objectives, shared respon- 
sibility and interdependence, as well as optimal performance. 
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Instruments for measuring team performance have been devel- 
oped over the past decades; however, adequate computer-based 
assessments of team-based performance are scarce (Fischer & 
Mandl, 2005). The recent advancement of web-based technology 
allowed widening the scope of computer-based assessments 
(Csap6, Ainley, Bennett, Latour, & Law, 2012; Frey & Hartig, 
2013). For example, international large-scale assessments such as 
the Programme for International Student Assessment (PISA) or the 
Programme for the International Assessment of Adult Competen- 
cies (PIAAC) currently implement advanced computer-based as- 
sessment environments (Organisation for Economic Co-operation 
and Development [OECD], 2010, 2013). 

Previously, most of the team-based assessment instruments re- 
quired a great deal of time and effort using highly trained research- 
ers (e.g., think-aloud protocol analysis) and were mainly limited to 
subjective self-reports (Wildman et al., 2012), and they as also 
required labor-intense manual analysis of performance indicators 
(Almond, Steinberg, & Mislevy, 2002). As a result, such assess- 
ments have been limited to the scientific community and have had 
only a minor impact on practical issues such as the design of 
effective learning, teaching, and working environments. Motivated 
by a desire to have practical assessment instruments that are useful 
and valid has led researchers to uncover significant developments 
in the last several years (Chung, O’Neil, & Herl, 1999; Mandl & 
Fischer, 2000). Especially instruments using graphical representa- 
tions for computer-based assessment have been successfully tested 
and implemented, such as the DEEP methodology (Spector & 
Koszalka, 2004), KU-Mapper (Taricani & Clariana, 2006), and 
knowledge mapping tools (Herl, O’ Neil, Chung, & Schacter, 1999; 
O’Neil, Chuang, & Baker, 2010). However, only a few of these 
instruments have been fully automated and tested for reliability 
and validity. Furthermore, automated and language-oriented as- 
sessment methodologies that enable a domain-independent analy- 
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sis without a reference to large text corpora are scarce (Clariana, 
2010). 

There were three aims in the present study: (a) to introduce a 
language-oriented approach toward automated computer-based as- 
sessment and visualization of team-based performance that can be 
applied in educational large-scale assessments; (b) to investigate 
the feasibility and validity of the automated computer-based as- 
sessment and visualization methodology focusing on team perfor- 
mance as a specific cross-curricular skill, (c) suggesting a frame- 
work for integrating research and methodology development for 
the implementation of innovative computer-based assessment en- 
vironments for international large-scale assessments (e.g., PISA, 
PIAAC). 


Team-Based Performance 


A successful team typically possesses an informational advan- 
tage over individuals (Mesmer-Magnus & Dechurch, 2009). How- 
ever, not all teams are able to take full advantage of these benefits. 
Some teams may even fail despite this advantage. Hence, there 
have been numerous attempts to identify the specific factors that 
make a team successful (Cannon-Bowers, Salas, & Converse, 
1993: Guzzo & Dickson, 1996; Katzenbach & Smith, 1993; Sikor- 
ski, Johnson, & Ruscher, 2012; Van den Bossche, Gijselaers, 
Segers, Woltjer, & Kirschner, 2011). For example, empirical re- 
search shows that through the use of combined resources, teams 
can successfully handle tasks and problems that otherwise would 
be too complex for a single individual (Badke-Schaub, Neumann, 
& Lauche, 2011; Bierhals, Schuster, Kohler, & Badke-Schaub, 
2007; Cannon-Bowers & Salas, 2001; Cooke, Salas, Kiekel, & 
Bell, 2004; Eccles & Tenenbaum, 2004; Salas, Cooke, & Rosen, 
2008). 

Overall, shared mental models are regarded as a significant 
factor for successful team performance (Bandura, 1977, 1986; 
Cannon-Bowers et al., 1993; Cooke, Salas, Cannon-Bowers, & 
Stout, 2000; Mathieu, Heffner, Goodwin, Salas, & Cannon- 
Bowers, 2000; Van den Bossche et al., 2011). However, the 
concept of shared mental model is used and interpreted differently 
by various disciplines, for example, industrial/organizational psy- 
chology, human factors, social psychology, or system dynamics 
(Cooke et al., 2004). Within cognitive and educational psychology, 
the term shared mental model (SMM) is based on the theory of 
mental models (Johnson-Laird, 1983) and reflects internal repre- 
sentations that individuals construct to make sense of experiences 
with the world (Wittgenstein, 1922). Hence, individuals construct 
mental models in order to understand and explain experiences and 
events, process information, and solve complex problems (Gentner 
& Stevens, 1983; Johnson-Laird, 1989). More precisely, the theory 
of mental models is based on the assumption that cognitive pro- 
cessing takes place in the use of mental representations in which 
individuals organize symbols or representations of experience or 
thought in such a way that they effect a systematic representation 
of this experience or thought as a means of understanding or 
explaining it to others (Johnson-Laird, 1983). Hence, in order to 
create subjective plausibility, individuals construct an internal 
model that both integrates the relevant semantic knowledge and 
meets the perceived requirements of the situation (Ifenthaler & 
Seel, 2013). This internal model is referred to as an individual 
mental model (IMM). An SMM is denoted as a shared represen- 


tation of a team that includes overlapping domain and task knowl- 
edge, skills, attitudes, objectives, processes, components, commu- 
nication, coordination, adaption roles, relationships, behavior 
patterns, and interactions (Bandura, 1986; Cooke et al., 2004; 
Klimoski & Mohammed, 1994; Mohammed & Dumville, 2001). 

Previous research shows that if team members share similar 
IMMs, they are more effective in their teamwork and perform 
better (Burke, Fiore, & Salas, 2003; Cannon-Bowers & Salas, 
2001; Marks, Zaccaro, & Mathieu, 2000; Salas et al., 1992; Van 
den Bossche et al., 2011). For example, Lim and Klein (2006) 
found that shared task knowledge and shared team knowledge 
were valid predictors for team performance. Similar results regard- 
ing the influence of shared task and team knowledge on team 
performance were found in a series of studies using flight simu- 
lators in laboratory settings (Mathieu, Heffner, Goodwin, Cannon- 
Bowers, & Salas, 2005; Mathieu et al., 2000). Findings of a 
meta-analysis performed by Salas et al. (2008) suggest that team 
processes and cognitive as well as affective dispositions moderate 
performance outcomes of teams. 

Although previous research highlighted different operationaliza- 
tions of SMMs (Akkerman et al., 2007; Cooke et al., 2004), this 
study is based on an extended cognitive perspective of SMM. 
Figure | illustrates the interaction of IMM and SMM as well as its 
influence on team processes and team performance. The IMM of 
each team member integrates complex knowledge structures on 
declarative, procedural, causal, and metacognitive levels (Jonas- 
sen, Beissner, & Yacci, 1993; Kant, 1781/1998). The overlap of 
the IMMs constitutes the SMM. Cannon-Bowers and Salas (2001) 
identified two major components of an SMM: task-related com- 
ponents and team-related components. As every team member 
shares a certain amount of those components, it is therefore pos- 
sible for a team to develop a collective understanding of tasks, 
conditions, and requirements that are needed to cope with a prob- 
lem to be solved. However, this overlap is a result of complex 
interrelations between individual declarative, procedural, causal, 
and metacognitive knowledge as well as shared task and team- 
related knowledge (Cannon-Bowers & Salas, 2001). Team pro- 
cesses describe the transformation of all inputs through social 
interaction among team members into results, such as critical 
perspectives, new ideas, conflicts, decisions, or material objects. 
Finally, the result of all actions reflects the team performance 
(Bierhals et al., 2007; Mathieu et al., 2000). 


Automated Computer-Based Visualization and 
Assessment Methodology 


Clearly, the direct assessment of an IMM or SMM is not 
possible (Jonassen & Cho, 2008; Miyake, 1986). Therefore, two 
classes of functions that describe the complex processes and in- 
terrelations between internal and external representations of men- 
tal models need to be considered (Ifenthaler, 2010c): (a) f,,, as the 
function for the internal representation of objects of the world 
(internalization) and (b) f.,,, a8 the function for the external rerep- 
resentation back to the world (externalization). None of these 
functions are directly observable either (Strasser, 2010). Thus, the 
assessment of IMMs and SMMs requires precise theoretical un- 
derstanding and in-depth empirical investigations. Moreover, the 
possibilities of externalization for valid assessments are limited to 
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Figure 1. 
and team performance. 


a few sets of sign and symbol systems, such as graphical or 
language-based approaches (Seel, 1999). Further, Minsky (1981) 
emphasized that different types of knowledge require different 
types of representations. Helbig (2006) argued that individuals are 
able to use different forms of representation of memorized infor- 
mation. They can either recall an appropriate form of representa- 
tion from memory or transform memorized information in an 
appropriate form of representation in dependence on situational 
demands (Markman, 1999). 

Because it is not possible to assess directly the internal 
representations, it is necessary to identify economic, fast, reli- 
able, and valid methodologies to elicit and analyze externalized 
representations of IMMs and SMMs (Johnson, Ifenthaler, 
Pirnay-Dummer, & Spector, 2009). Despite recent advances in 
educational technology, empirical research and application of 
fully automated assessment methodologies in complex domains 
(Greiff, Wiistenberg, Molnar, et al., 2013) and IMMs and 
SMMs are very limited (Carley, 1997; Fischer, Bruhn, Grasel, 
& Mandl, 2002; Fischer & Mandl, 2005; Mohammed, Klimosk1, 
& Rentsch, 2000). 

The newly developed and fully automated Automated 
Knowledge Visualization and Assessment (AKOVIA) method- 
ology is based on mental model theory (Johnson-Laird, 1989) 
and integrates a large number of dynamic interfaces to different 
online environments, for instance, learning management sys- 
tems, personalized learning environments, game-based environ- 
ments, or computer-based assessment environments such as 
PISA or PIAAC. This open architecture of AKOVIA enables a 
large variety of research and practical applications, such as 
investigation of learning processes; distinguishing features of 
subject domains; cross-curricular, nonroutine, dynamic, and 
complex skill; or convergence of team-based knowledge. 
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Analysis Algorithm 


The underlying assumption of AKOVIA is that IMMs and 
SMMs can be externalized and rerepresented as a graph (Rumel- 
hart & Norman, 1978). A graph consists of a set of vertices whose 
relations are represented by a set of edges (Tittmann, 2010). 
Various measures from graph theory have been successfully ap- 
plied in previous studies in order to analyze externalized IMMs 
and SMMs and their development over time within different 
environments and domains (Ifenthaler, Masduki, & Seel, 2011; 
O’Neil et al., 2010; Schvaneveldt, 1990). However, most of these 
studies used graphical rerepresentations (e.g., knowledge map, 
causal map) to assess the externalized IMM or SMM (Chung et al., 
1999). 

Using a natural language-oriented approach limits the bias of 
externalization (e.g., through causal maps, which require extensive 
training), as language is regarded as the most automated and direct 
form of externalization (Chomsky, 1970; Searle & Grewendorf, 
2002). Given the latest developments in educational technology, 
language-based approaches have been developed, such as text 
classification and machine learning methodologies, which allow 
for the automatic processing and analyzing of texts in various 
forms (Gweon, Rosé, Wittwer, & Nueckles, 2005). 

AKOVIA integrates such a language-oriented methodology. It 
follows the axiom on association and sequences: What is closely 
related is also closely externalized (Frazier, 1999; Pollio, 1966). 
The methodology relies on the dependence of syntax and seman- 
tics within natural language and uses the associative features of 
text as a heuristic to rerepresent knowledge. Unlike approaches 
from latent semantic analysis (Foltz, Kintsch, & Landauer, 1998), 
Web ontologies and semantic Web (Ding, 2001), the language- 
oriented approach of AKOVIA can operate on a comparably small 
amount of text (approximately 300 words) and does not require a 
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reference to large (web-based) text corpora. In order to provide 
cross-curricular applications, AKOVIA is operating domain inde- 
pendently on available text input into the system. 

AKOVIA’s language-oriented analysis is carried out in multiple 
stages (Ifenthaler & Pirnay-Dummer, 2014): 


1. Text input and cleaning is where the text is taken into the 
system (upload function through a web browser or via 
database interface) and checked for a specified character 
set, including the deletion of metadata such as HTML 
tags. 


2. Text parsing, stemming, and calculation of word associ- 
ations is where the text is split into sentences and single 
tokens such as words, punctuation marks, and quotation 
marks. Through a rule-based and corpus-based tagging 
process, nouns and names are identified within the text 
(Brill, 1995). Next, the stemming process reduces all 
words to their word stems as different inflections of a 
word need to be treated as one in the further analysis 
process (e.g., singular and plural forms such as book and 
books). Then, the associatedness is calculated by (a) 
identifying the default length of sentences, that is, the 
longest sentence in the text plus one and counting the 
number of words for each individual sentence; (b) iden- 
tifying all possible pairs of words; (c) calculating the 
distance between words of all pairs within all sentences, 
that is, the minimum number of words between the words 
of the pair in a single sentence; (d) calculating the sum of 
distances for the text for all pairs; (e) building a hierarchy 
of distances for all pairs; (f) final output generation 
including the lowest sum of distances, that is, only pairs 
with association evidence in the text are included. 


Table 1 
Description of the Seven AKOVIA Measures 


Measure [abbreviation] and type 


Surface matching [SFM] Structural indicator 
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3. The graph-based analyses include seven measures that 
enable a quantitative description of structural and seman- 
tic features of the text (see Table 1). The structural and 
semantic comparisons identify similarities of frequencies 
or sets of properties between texts, for example, expert 
text versus novice text (Goldsmith & Davenport, 1990). 
The quantitative measures (see Table 1) are defined be- 
tween s = 0 (complete exclusion) and s = 1 (complete 
identity); 0 < s = 1 (Tversky, 1977). 


4. The graphical output is realized with the help of the 
open-source graph visualization software GraphViz (Ell- 
son, Gansner, Koutsofios, North, & Woodhull, 2003). 
GraphViz uses the list form of the hierarchy of distances 
for all pairs and constructs a vertex-edge-vertex repre- 
sentation of the most frequent pairs (see Figure 2). Each 
vertex of the graphical output contains a destemmed 
word of the pairs. The edge of the graphical output 
contains an indicator for the noun-distance generated 
from the text (Pirnay-Dummer & Ifenthaler, 2011). Ad- 
ditionally, different colors of the edges indicate the 
strength of association. The graphical output can also be 
generated without the indictors on the edges. 


The automated language-oriented analysis can be applied do- 
main independently for written texts (e.g., essay text) or graphical 
representations (e.g., causal map, concept map) against a single or 
multiple reference models (Coronges, Stacy, & Valente, 2007). 
The reference model can be either the same individual’s prior 
understanding of a phenomenon in question, another team mem- 
ber’s understanding, a shared or aggregated understanding of the 
phenomenon, or an expert solution of the phenomenon in question. 
Cross-comparisons between written texts and graphical represen- 


Short description 


The surface matching compares the number of vertices within two graphs. It is a simple and easy 


way to calculate values for surface complexity. 


Graphical matching [GRM] Structural 
indicator 


The graphical matching compares the diameters of the spanning trees of the graphs, which is an 
indicator for the range of conceptual knowledge. It corresponds to structural matching as it is 


also a measure for structural complexity only. 


Structural matching [STM] Structural 
indicator 


The structural matching compares the complete structures of two graphs without regard to their 
content. This measure is necessary for all hypotheses that make assumptions about general 


features of structure (e.g., assumptions, which state that expert knowledge is structured 
differently from novice knowledge). 


Gamma matching [GAM] Structural indicator 


The gamma matching describes the quotient of terms per vertex within a graph. Because both 


graphs that connect every term with each other term (everything with everything) and graphs 
that only connect pairs of terms can be considered weak models, a medium density is expected 
for most good working models. 


Concept matching [CCM] Semantic indicator 


Concept matching compares the sets of concepts within a graph to determine the use of terms. 


This measure is especially important for different groups that operate in the same domain (e.g., 
use the same textbook). It determines differences in language use between the models. 


Propositional matching [PPM] Semantic 
indicator 

Balanced semantic matching [BSM] Semantic 
indicator 


The propositional matching value compares only fully identical propositions between two graphs. 
It is a good measure for quantifying semantic similarity between two graphs. 

The balanced semantic matching is the quotient of propositional matching and concept matching. 
In specific cases (e.g., when focusing on complex causal relations), balanced propositional 


matching could be preferred over propositional matching. 


— SS  A>vw>jw>e_woq_ccec ooo See 


Note. AKOVIA = Automated Knowledge Visualization and Assessment. 
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Automated Knowledge Visualization and Assessment standardized graphical output. The association 


strength between concepts is displayed by the numbers on the links. The value outside the brackets shows the 
weight from the list form; the second value inside the bracket displays the weight relative to what is actually 


visualized. 


tations are also possible (Ifenthaler, 2011). For team-based assess- 
ment, the aggregate function allows the grouping of individual 
representations into an aggregated representation (Pirnay-Dummer 
& Ifenthaler, 2010). 


Test Quality and Application 


The underlying analysis algorithms of AKOVIA have been 
successfully tested for reliability and validity in various experi- 
mental studies focusing on IMMs (Al]-Diban & Ifenthaler, 2011; 
Johnson et al., 2011; McKeown, 2009; Pirnay-Dummer, Ifenthaler, 
& Spector, 2010). Reliability scores exist for single measures 
integrated into AKOVIA. They range from r = .79 to .94 and are 
tested for the structural and semantic measures separately and 
across different domains. Validity scores are also reported sepa- 
rately for the structural and semantic measures (convergent valid- 
ity r = .71 to r = .91) (Pirnay-Dummer & Ifenthaler, 2010). 

Kim (2012) as well as Al-Diban and Ifenthaler (2011) con- 
ducted cross-validation studies in order to identify the test quality 
of the underlying AKOVIA analysis algorithms. Both studies 
identified acceptable robustness of the automatically generated 
results when compared with traditional manual analysis proce- 
dures such as qualitative content analysis. 

In a recent longitudinal study using the AKOVIA methodology, 
an in-depth hierarchical linear modeling analysis revealed patterns 


of the learning-dependent progression of IMMs on structural and 
semantic levels (Ifenthaler et al., 2011). In a series of experimental 
studies, the effectiveness of preflective (Ifenthaler & Lehmann, 
2012) and reflective (Ifenthaler, 2012) prompts for self-regulated 
learning within problem-solving processes were successfully in- 
vestigated with the AKOVIA methodology. Another experimental 
study compared domain-distinguishing features of IMMs using the 
structural and semantic comparison functions of AKOVIA (lIf- 
enthaler, 2011). The results showed unique features of the biology, 
history, and mathematics domains. The AKOVIA methodology 
was also applied in order to compare unique features of written 
essays and causal representations (Johnson et al., 2011). Johnson et 
al. (2011) identified in their study significant differences of struc- 
tural properties and semantic content of written texts and causal 
representations when produced by the same learner within identi- 
cal domains. Pirnay-Dummer and I[fenthaler (2011) applied the 
AKOVIA methodology for automatically generating feedback 
models on the fly. They found that the automatically generated 
feedback models had identical impact on problem solving when 
compared with feedback models generated by domain experts. 
Another series of experimental studies investigated variations of 
automated feedback models (standardized graphical output) gen- 
erated with AKOVIA (Ifenthaler, 2009, 2010a). These studies 
highlight the benefits of automated feedback models for online 
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learning environments and self-regulated learning because they 
can offer scaffolding and feedback whenever the learner re- 
quires it. 


Research Questions and Hypotheses 


The central research objective was to advance knowledge in 
automated and language-oriented computer-based assessment of 
team-based performance that can be applied in educational large- 
scale assessments. Therefore, this study was designed to investi- 
gate the feasibility and validity of the AKOVIA assessment and 
visualization methodology focusing on team performance as a 
specific cross-curricular skill. Specifically, we tested (a) the un- 
derlying analysis algorithm assuming that the structural and se- 
mantic measures of AKOVIA precisely identify similarities and 
differences in team-based performance. Further, we tested (b) 
whether the team-related knowledge was related to team-based 
performance identified through the structural and semantic mea- 
sures of AKOVIA. 

With regard to (a), past investigations link team-based perfor- 
mance to individual knowledge and abilities as well as identified 
that a collective lack of knowledge, skills, and resources lead to 
ineffective performance (Kozlowski & Ilgen, 2006; Moreland & 
Levine, 1992). Hence, numerous concepts have been postulated in 
order to test the composition of individual knowledge and charac- 
teristics on team performance (Chambers & Abrami, 1991; Marks 
et al., 2000). Among those, task-related knowledge of individual 
team members has been identified as a critical predictor for team- 
based performance (Horwitz, 2005), suggesting that diversity 
among team members as well as high knowledge levels have a 
positive effect on the team’s performance (Cox & Blake, 1991; 
Hambrick, Cho, & Chen, 1996). As part of assessing the validity 
of the structural and semantic measures of AKOVIA, we adhere to 
the question whether teams composed of different levels of task- 
related knowledge will show superior performance relative to 
homogeneously composed teams. 


Hypothesis 1: It is hypothesized that the structural (Hypoth- 
esis la) and semantic (Hypothesis 1b) measures of AKOVIA 
provide evidence for differences of team performance be- 
tween differently composed teams based on their task-related 
knowledge. 


With regard to (b), it is commonly accepted that team-related 
knowledge predicts team-based performance (Badke-Schaub et al., 
2011; Cannon-Bowers & Salas, 2001; Johnson & Lee, 2008). For 
example, Lim and Klein (2006) identified that team-related knowl- 
edge influenced team performance. More precisely, their findings 
suggest that task shared mental models (TaSMMs) and team 
shared mental models (TeSMMs) were positively predicting team 
performance. Similar findings have been reported by Mathieu et al. 
(2000). Accordingly, the relation between SMMs and team-based 
performance will be further indicators of validity for the structural 
and semantic measures of AKOVIA. 


Hypothesis 2: It is hypothesized that greater levels of 
TaSMMs (Hypothesis 2a) and TeSMMs (Hypothesis 2b) are 
associated with higher team-based performance assessed with 
the structural and semantic measures of AKOVIA. 


Method 


Participants 


Participants were freshman university students enrolled in an 
introductory psychology course at a German university. They 
participated for extra course credit. Five students who did not 
provide complete data were excluded from the sample. The final 
sample consisted of 224 students (68 men, 156 women; mean 
age = 21.6 years, SD = 2.48, min = 18, max = 33). They studied 
an average of 2.83 semesters (SD = 2.92). For 90% of the 
participants, it was their first enrollment in a university program, 
and 19% of the participants reported that they successfully finished 
a formal vocational training before the current university educa- 
tion. Of the participants, 79% rated their ability to work in a team 
as high; 81% rated their computer and social media skills as 
medium to high. 


Design 


A three-step computer-based algorithm was used to assign par- 
ticipants to teams and experimental conditions. The first step 
determined the participant’s total score of the introductory 
domain-specific knowledge test and the verbal abilities test (see 
the Measures section for detailed description of instruments). The 
second step determined the range of total scores for the three 
experimental conditions: (a) low knowledge team (LKT) com- 
posed of individual participants with low total scores (domain- 
specific knowledge and verbal abilities tests), (b) high knowledge 
team (HKT) composed of individual participants with high total 
scores, and (c) mixed knowledge team (MKT) composed of indi- 
vidual participants with low and high total scores (see Table 2; see 
also the Measures section for a detailed description of instru- 
ments). The third step randomly assigned participants to teams of 
four members, resulting in 56 teams (LKT, n, = 18 teams; HKT, 
n> = 18 teams; MKT, n; = 20 teams). Tasks, materials, and 
procedure were identical for all experimental conditions. 


Online Environment and Team Tasks 


The online environment included all necessary information (tasks, 
reading materials, measures, feedback, contacts) for the individual 
participants and teams. Participants used a self-generated unique 
identifier to gain access to the online environment. After the first 
access to the online environments, participants were prompted to 
contact their team members. The study included two tasks (Tal, Ta2) 


Table 2 
Means and Standard Deviations of Domain-Specific Knowledge 
Test and the Verbal Abilities Test 





MKT 


EK HKT 
SDM SSD 


Variable Mix wat Digs ME 





Domain-specific knowledge pretest 2.86 1.04 5.10 1.35 3.79 1.27 
Domain-specific knowledge posttest 4.38 1.81 6.25 1.42 5.23 1.34 
Verbal abilities test SEE KY ESO DEY "(OOS AI 





Note. LKT = low knowledge team; HKT = high knowledge team; 
MKT = mixed knowledge team. 
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to be solved by the teams in the form of a written essay (see Table 3 
for a detailed description of tasks and instructions). For each task, a 
domain expert generated a reference solution on the basis of the 
available learning materials in the form of a written essay. The tasks 
were generated automatically as a portable document format (PDF) 
document and were available for download any time participants 
accessed the online environment. The document included the partic- 
ipant’s unique identifier, team number, and team members as well as 
all task-related instructions including references to the learning ma- 
terials (see Table 3). An upload function in the online environment 
was available for handing in the task solution within a predefined time 
frame (72 hr after availability of the task). All team members had to 
upload the agreed-upon team’s solution in order to control for their 
participation and to prompt them to additional assessments. Accord- 
ingly, at specific points of the study, participants were asked to 
complete short questionnaires in the online environment (see the 
Procedure section for details about the sequence and frequency of the 
applied assessments). 


Measures 


Domain-specific knowledge. The domain-specific knowledge 
test included 11 multiple-choice questions with five possible solutions 
each (one correct, four incorrect). They were developed on the basis 
of introductory reading materials available to all students of the 
psychology course with a special focus on learning and assessment. A 
pilot study (V = 8 participants, independent from the participants of 
the main study) was used to test the average difficulty level to account 


for ceiling effects. Two identical versions (in which the 11 multiple- 
choice questions appeared in a different order) of the domain-specific 
knowledge tests (pre- and posttest) were administered. For example, 
items administered included “What is the definition of classical con- 
ditioning?” or “What does the forgetting curve hypothesize?” It took 
about 6 min to complete the test. 

Verbal abilities. Participants’ verbal abilities were tested with a 
subscale of the I-S-T 2000R intelligence test (Intelligenz-Struktur- 
Test; Amthauer, Brocke, Liepmann, & Beauducel, 2001). A total of 
20 sentences with a missing word had to be completed using a set of 
five words. Overall, the widely used intelligence test has a high 
reliability (r = .88—.96; split-half reliability). It took about 6 min to 
complete the test. 

Team Assessment Diagnostic Measure (TADM). The 
TADM questionnaire (Johnson, Lee, Lee, & O’Connor, 2007) iden- 
tifies two team-related factors: (a) the TaSMM (eight items) and (b) 
the TeSMM (seven items). The 15 items were answered on a 5-point 
Likert scale (1 = strongly disagree, 2 = disagree, 3 = not sure, 4 = 
agree, 5 = strongly agree). For example, an item related to the 
TeSMM reads: “My team communicates effectively with each other 
while performing our tasks.” Johnson et al. (2007) reported acceptable 
content validity and successful factorial structure of the instrument. 
Cronbach’s alpha ranges from .75 to .89. 


Procedure 


In the first phase of the study, the participants created a unique 
identifier when accessing the online environment and completed a 





Table 3 
Tasks Available in the Online Environment as Downloadable PDF Document (Translated From 
German) 
Measurement 
point Task description and instruction 
Task | Please work with your team members to respond to the following question in the form of 
a written text (one page), which shall be published in a teacher magazine: How did the 
concept of learning change during the 20th century? 

[Link to key reference material] 

Contact your three team members (see contact details below) and organize at least one 
synchronous virtual meeting (e.g., via Skype) in which you discuss and solve your 
task. Each of your team members will need to upload the solution in the online 
environment within the next 72 hours. After uploading the solution, you will be 
prompted to answer two short questionnaires. 

[Personal identifier] 

[Team number] 

{Names and contact information of team members] 

[Link to online environment] 

[Contact information of examiner] 

Task 2 Please work with your team members to respond to the following question in the form of 


a written text (one page), which shall be published in a teacher magazine: Which 
functions do performance assessments in schools have? 


[Link to key reference material] 


Contact your three team members (see contact details below) and organize at least one 
synchronous virtual meeting (e.g., via Skype) in which you discuss and solve your 
task. Each of your team members will need to upload the solution in the online 
environment within the next 72 hours. After uploading the solution, you will be 
prompted to answer two short questionnaires. 


[Personal identifier] 
{Team number] 


[Names and contact information of team members] 


{Link to online environment] 


[Contact information of examiner] 
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demographic data questionnaire as well as the introductory 
domain-specific knowledge test and verbal abilities test. As out- 
lined above (see the Design section), the total score of the intro- 
ductory tests were automatically generated, and teams (four mem- 
bers each) were randomly composed on the basis of the results of 
the tests and the definition of experimental conditions (LKT, HKT, 
MKT). After 72 hr, participants accessed the online environment 
and were prompted to the first task (Tal), including a detailed 
instruction and the contact information of the team members (all as 
a downloadable PDF document) (see Table 3). The teams were 
asked to meet at least once for a synchronous virtual interaction 
(e.g., Skype) within the next 72 hr. Each team member completed 
the team assessment and diagnostic measure (TADM) before up- 
loading the team solution within the allocated 72 hr. In the third 
phase of the study, participants accessed the online environment 
and were promoted to the second task (Ta2; a downloadable PDF 
document including instructions and contact information of the 
team members) (see Table 3). After solving the second task in the 
team, including at lest one synchronous virtual interaction (e.g., 
via Skype) within 72 hr, each team member completed the TADM 
questionnaire before uploading the team solution within the allo- 
cated 72 hr. Finally, students completed the postversion of the 
domain-specific knowledge test. 


Variables and Data Analyses 


The quality of the team-based performance (written essays) 
was analyzed with AKOVIA by comparing them against the 
reference solution, which was based on an expert solution of the 
individual task (Tal, Ta2) and the available learning materials. 
This study uses two AKOVIA measures: Gamma Matching 
(GAM) and Balanced Semantic Matching (BSM). BSM, as the 
quotient of all unique semantic AKOVIA measures, has proved 
preferable for semantic comparisons of written texts as it in- 
cludes the semantic information of single concepts and more 
complex semantic information of propositions (Johnson et al., 
2011; Pirnay-Dummer & Ifenthaler, 2011). Other studies suc- 
cessfully used GAM for identifying the complexity within 
several subject domains as it reflects the structural connected- 
ness of knowledge externalizations (Ifenthaler et al., 2011; 
McKeown, 2009). Reliability coefficients of the administered 
instruments are acceptable and consistent with previously re- 
ported results: verbal abilities (split-half reliability = .982), 
TaSMM, (Cronbach’s a = .840), TaSMM, (Cronbach’s a = 





876), TeSMM, (Cronbach’s a = .835), TeSMM, (Cronbach’s 
a = .851). Table 4 shows all variables and scoring specification 
of the study. 

We conducted an analysis of variance (ANOVA) to analyze 
between-group differences by experimental groups (Hypothesis 1). 
In order to control for Type I error, Tukey’s honestly significant 
difference (HSD) post hoc comparisons were used to examine 
differences between experimental groups. As a second major an- 
alytic strategy, regression models were performed, one for each of 
the four team-based performance measures as the dependent vari- 
ables (Hypothesis 2): GAM, (structural measure) and BSM, (se- 
mantic measure). 


Results 


Descriptive Analyses 


The average text length of the responses for Tal was M = 
379.79 words (SD = 85.43; min = 238; max = 546) and for Ta2 
M = 385.55 words (SD = 82.27; min = 275; max = 546). 
Accordingly, the required text length for a valid AKOVIA analysis 
(approximately 300 words) was met. No significant differences of 
text length were found between the experimental groups (LKT, 
HKT, MKT) for Tal, F(2, 221) = 0.375, ns, and Ta2, F(2, 221) = 
0.137, ns, as well as between the two tasks, t(223) = 1.254, ns. 

On the domain-specific knowledge test (pre- and posttest), par- 
ticipants could score a maximum of 11 correct answers. In the 
pretest (DKpre), they scored an average of M = 3.91 correct 
answers (SD = 1.52; min = 0; max = 8), and in the posttest 
(DKpost) they scored an average of M = 5.28 correct answers 
(SD = 1.70; min = 0; max = 10). The increase in correct answers 
was significant, (223) = 13.863, p < .001, d = 1.857. 


Hypothesis 1: Differently Composed Teams 


Table 5 summarizes the means and standard deviations for the 
structural (GAM) and semantic (BSM) team-based performance 
for each task (Tal, Ta2) and the three experimental groups (LKT, 
HKT, MKT). 

Tal. With regard to Hypothesis 1a (structural AKOVIA mea- 
sure indicating differences in team performance), an ANOVA 
revealed significant differences between the three experimental 
groups for the structural team-based performance measure 
(GAM,), F(2, 221) = 48.037, p < .001, n? = .303. Tukey’s HSD 


Scoring 


AKOVIA structural similarity measure; 0 = GAM, = 1; for task; t = 1, 2 
AKOVIA semantic similarity measure; 0 = BSM, = 1; for task; t = 1, 2 
Sum of correct answers; 0 = DKpre = 11 

Sum of correct answers; 0 = DKpost = 11 

DKgain = DKpost — Dkpre; —\1 = DKgain = 11 

Sum of correct answers; 0 = VA < 11 

Mean rating of scale items; 1 = TaSMM, S 5; for task; t 


Il Ul 


Table 4 

Variables of the Study, Corresponding Instrument, and Description of Scoring 
Variable [abbreviation] Instrument 

Gamma matching [GAM,] AKOVIA 

Balanced semantic matching [BSM,] AKOVIA 

Domain-specific knowledge (pretest) [DKpre] Multiple-choice test 

Domain-specific knowledge (posttest) [DKpost] Multiple-choice test 

Domain-specific knowledge gain [DKgain] Multiple-choice test 

Verbal abilities [VA] I-S-T 2000R 

Task shared mental model [TaSMM,] TADM 

Team shared mental model [TeSMM,] TADM 





Note. AKOVIA = Automated Knowledge Visualization and Assessment; I-S-T 2000R = Intelligenz-Struktur-Test; TADM = Team 


Diagnostic Measure. 


le? 
Mean rating of scale items; 1 = TeSMM, < 5; for task; t = 1, 2 
A 


ssessment 
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Table 5 
Means and Standard Deviations of Team-Based Performance 
LKT HKT MKT 

Task M SD M SD M SD 
GAM 

Tal 646 145 802 147 843 091 

Ta2 .674 181 .684 ple 647 187 
BSM 

Tal 502 254 POR) 129 .748 .036 

Ta2 605 .176 164 33 826 094 


Note. LKT = low knowledge team; HKT = high knowledge team; 
MKT = mixed knowledge team; Tal = Task 1; Ta2 = Task 2; GAM = 
gamma matching; BSM = balanced semantic matching. 


post hoc comparisons indicate that the HKTs (95% CI [.767, .836], 
p < .001) and MKTs (95% CI [.823, .863], p < .001) gained 
significantly higher structural similarity than the LKTs (95% CI 
[.612, .680]). However, comparisons between the HKTs and 
MKTs were not statistically significant at p < .05. 

A second ANOVA was computed in order to test Hypothesis 1b 
(semantic AKOVIA measure indicating differences in team per- 
formance). An ANOVA revealed a significant difference between 
the three experimental groups for the semantic team-based perfor- 
mance measure (BSM,), F(2, 221) = 43.351, p < .001, 1 = .282. 
Tukey’s HSD post hoc comparisons indicate that the HKTs (95% 
CI [.592, .652], p < .001) and MKTs (95% CI ([.740, .756], p < 
.001) gained significantly higher semantic similarity than the 
LKTs (95% CI [.442, .562]). The Tukey’s HSD comparisons 
between the HKTs and MKTs was statistically significant at p < 
001. 

Ta2. With regard to Hypothesis 1a (structural AKOVIA mea- 
sure indicating differences in team performance), no significant 
between-group differences could be identified across the three 
experimental groups for the structural team-based performance 
measure (GAM), F(2, 221) = 0.831, ns. 

An ANOVA was computed in order to test Hypothesis 1b 
(semantic AKOVIA measure indicating differences in team per- 
formance). A significant difference was found between the three 
experimental groups for the semantic team-based performance 
(BSM,), F(2, 221) = 51.559, p < .001, n* = .318. Tukey’s HSD 
post hoc comparisons indicate that the HKTs (95% CI [.732, .795], 
p < .001) and MKTs (95% CTI [.805, .847], p < .OO1) gained 
significantly higher semantic similarity than the LKTs (95% CI 
[.563, .646]). The Tukey’s HSD comparisons between the HKTs 
and MKTs was statistically significant at p = .016. 

To sum up, the expected differences of team-based performance 
between differently composed teams, based on their task-related 
knowledge, were found on both the structural (GAM) and semantic 
(BSM) measures of AKOVIA. 


Hypothesis 2: Influence of SMMs 


For each outcome variable (structural team performance, GAM,; 
semantic team performance, BSM,) and task (Tal, Ta2), we con- 
ducted a regression analysis accounting for prediction by TaSMM, 
and TeSMM,. 


Tal. Table 6 shows the zero-order correlations of predictors 
used in the regression analyses for Tal, indicating significant 
correlations between structural/semantic team performance and 
Ta/TeSMM. 

The results of the regression analyses for structural (GAM,, and 
semantic (BSM,) team-based performance are presented in Table 
7, yielding a AR? of .163 and .244 for GAM, and BSM,, respec- 
tively. For GAM,, the SMM contributed unique variance to the 
structural team-based performance. Specifically, TaSMM, and 
TeSMM, positively predicted the structural team-based perfor- 
mance (GAM,,), indicating that the higher the sharedness of task 
and team knowledge, the higher the structural team-based perfor- 
mance (see Table 7). For BSM,, the SMM contributed unique 
variance to the semantic team performance. Specifically, TaSMM, 
and TeSMM, positively predicted the semantic team-based per- 
formance (BSM,,), indicating that the higher the sharedness of task 
and team knowledge, the higher the semantic team-based perfor- 
mance (see Table 7). 

Ta2. Table 8 shows the zero-order correlations of predictors 
used in the regression analyses for Ta2, indicating significant 
correlations between semantic team performance and TeSMM. 

The regression analysis for structural team-based performance 
(GAM.,) did not explain a significant amount of variance (AR? = 
.001), F(2, 221) = 1.114, ns. For the semantic team-based perfor- 
mance (BSM.,), the regression model explained AR* = .114 of 
variance (see Table 7). Specifically, TaSMM, positively predicted 
the semantic team-based performance (BSM.,), indicating that the 
higher the sharedness of task knowledge, the higher the semantic 
team-based performance (see Table 7). 

To sum up, for Tal, the findings suggest that higher structural 
and semantic team-based performance were predicted by greater 
levels of sharedness of task and team knowledge. For Ta2, the 
findings suggest that higher semantic team-based performance was 
predicted by greater levels of sharedness of task knowledge. 


Discussion 


Cooke et al. (2000) reviewed the strengths and weaknesses of 
methods for assessing team knowledge, including observations, 
interviews, questionnaires, process tracing, and knowledge map- 
ping. None of these methods, however, used automated algorithms 
for natural language-oriented assessment of team performance. 
Hence, in this study, the feasibility and validity of a language- 
oriented approach toward automated computer-based assessment 
and visualization of team-based performance was investigated, 


Table 6 
Descriptives and Zero-Order Correlations of Predictor 
Variables for Task I 


Variable | 2 B 4 
1. Task shared mental model — 
2. Team shared mental model B01" — 
3. Structural team performance ole QAM — 
4. Semantic team performance 454" .345™* 390"* — 
M 3.87 3.95 i 63 
SD 67 56 5 19 
pee 
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Table 7 
Regression Analysis Predicting Structural (GAM,) and Semantic (BSM,) Team Performance for Task 1 and Task 2 
Task 1 
GAM, BSM, 
Variable R? Adjusted R? B SE B B Ra Adjusted R* B SEB B 
Shared mental model 7. .163 .250 244 ie, 
Task shared mental model .076 .015 305 a .110 018 383" 
Team shared mental model .046 018 NS .076 021 oe 
Task 2 
GAM, BSM, 
R? Adjusted R? B SEB 6B R? Adjusted R? B SEB B 
Shared mental model 010 001 ~b22 114 es 
Task shared mental model sili .025 S038 .116 021 374 
Team shared mental model .033 .022 nO? = 022 019 082 


a Ol gp = s00L, 


which can be applied in educational large-scale assessments (e.g., 
PISA, PIAAC). 

The computer-based AKOVIA methodology integrates a multi- 
stage language-oriented algorithm that transforms text into a list 
form and a proximity matrix by assigning distances and weights to 
single words and identifying associative evidence in sentences and 
the text (Pirnay-Dummer & Ifenthaler, 2010). The resulting list 
form and proximity matrix enables an in-depth analysis of struc- 
tural and semantic features of the text. Currently, AKOVIA sup- 
ports four structural, three semantic, and additional graph theory- 
based measures (see Table 1). The structural measures identify 
surface features of the knowledge representation such as the sum 
of concepts (SFM; Surface Matching), the complexity of concepts 
(GRM; Graphical Matching), and the connectedness of concepts 
(GAM). Additionally, deep structural features can be analyzed by 
deconstructing the knowledge representation into the smallest pos- 
sible units (STM; Structural Matching). The semantic measures 
operate on the stemmed words of the knowledge representation 
identifying predefined semantic features of single words (CCM; 
Concept Matching) or propositions (PPM; Propositional Match- 
ing) defined as a word linked to another word (Ifenthaler, 2010b). 
The quotient of PPM and CCM results in the BSM. Graph theory- 
based measures provide evidence about the connectedness of the 
knowledge representation (i.e., all concepts are linked to reach 
every concept from every other concept). Ruggedness indicates the 


Table 8 
Descriptives and Zero-Order Correlations of Predictor 
Variables for Task 2 





Variable 1 2 3 - 





1. Task shared mental model — 





2. Team shared mental model 406" = 

3. Structural team performance 006 094 — 

4. Semantic team performance 341" .070 .060 — 
M 4.04 3:93 .67 8 
SD oe .60 18 mu 
Spi Ole 


sum of subrepresentations, that is, independent concepts or prop- 
ositions not linked to other parts of the knowledge representation 
(Ifenthaler et al., 2011; Schvaneveldt, 1990). The standardized 
graphical output is constructed from the N strongest relations 
within the whole proximity matrix of the knowledge representation 
using GraphViz (Ellson et al., 2003). N can be set within the 
AKOVIA analysis functions in order to accommodate specific 
assessment and analysis requirements such as limited word length 
of available texts. 

As suggested by previous empirical research (Horwitz & Hor- 
witz, 2007), our findings revealed significant differences of struc- 
tural and semantic team-based performance between differently 
composed teams, that is, low task-related knowledge, high task- 
related knowledge, and mixed task-related knowledge. More spe- 
cifically, the results of this study suggest that AKOVIA’s BSM is 
an acceptable measure for language-oriented assessment of team- 
based performance. BSM balances the dependency of semantically 
correct concepts (vertices) and causal relations (i.e., propositions) 
of semantically correct concepts as well as includes structural 
features (Ifenthaler, 2010c). Whereas CCM only identifies the 
semantically correct use of single concepts and PPM only identi- 
fies the use of semantically correct propositions, BSM accounts for 
all of these features. Additionally, the expected differences of 
team-based performance between differently composed teams, 
based on their task-related knowledge, were found for the struc- 
tural (GAM) measure of AKOVIA. 

Further, results suggest that higher structural and semantic team- 
based performance measured with AKOVIA were predicted by 
greater levels of sharedness of task and team knowledge. The 
identified relations between SMMs and team-based performance 
are consistent with previous research (Badke-Schaub et ale Olle 
Cannon-Bowers & Salas, 2001). However, the results for the two 
different tasks were not fully consistent as TeSMM was not a 
significant predictor for GAM and BSM. This could be attributed 
to the TeSMM factor of the TADM questionnaire, which was not 
tested and implemented with repeated measures designs in previ- 
ous studies. Accordingly, further empirical investigations with 
TADM should focus on the consistency of the instrument over 
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time. A further data analysis with additional structural AKOVIA 
measures (SFM, GRM, STM) confirmed the result of the GAM 
measure. No correlation between the TeSMM factor and structural 
AKOVIA measures were found. 

Hence, the findings of this study provide initial but resilient 
evidence of the feasibility and validity of the AKOVIA method- 
ology for an automated and language-oriented computer-based 
assessment of team performance. As the analysis algorithm is 
scalable and adaptive to educational settings and assessments, 
application within international large-scale assessments should be 
explored in the future. Still, limitations of this study need to be 
addressed, and further empirical research is required to replicate 
and advance the findings of this study. 


Limitations and Future Research 


As with all experimental research, there are obvious limitations 
to this study that require consideration, primarily referring to 
sample characteristics and methodological issues. 

First, the conceptual model focusing on the interaction of IMMs 
and SMMs and their influence on team processes as well as team 
performance has been informed by the current state of research on 
teams (Cannon-Bowers & Salas, 2001; Van den Bossche et al., 
2011). However, the conceptual model includes a strong cognitive 
perspective and therefore does not emphasize affective, social, 
communication, and process-oriented components (Bartelt, Den- 
nis, Yuan, & Barlow, 2013; Wildman et al., 2012). Hence, future 
work should advance the conceptual model and test the underlying 
theoretical assumption within controlled experimental settings. 

Second, although our sample was large enough to achieve 
statistically significant results, the explained variance for our re- 
gression models was rather moderate. This indicates that besides 
the tested variables, others may have influenced the outcomes that 
were not tested in this study. 

Third, the sample included a select group of participants from 
one university all enrolled in a specific course, thus prohibiting 
generalizations of results. Further, all participants of this study 
were inexperienced within the subject domains. This fact clearly 
limits the external validity of our findings (Campbell & Stanley, 
1963). Accordingly, future studies should include various levels of 
difficulty, task type, and dispositions of participants within and 
across different subject domains. 

Forth, although the participants’ prior domain-specific knowl- 
edge was assessed, their causal, procedural, and metacognitive 
knowledge was not explicitly tested. Hence, possible effects of 
these variables were not addressed in this empirical investigation. 
Future studies may include a more comprehensive assessment of 
knowledge dimensions of individuals and teams (Wildman et al., 
2012). 

Fifth, the language-oriented analysis algorithm requires approx- 
imately 300 words for producing valid results (Pirnay-Dummer & 
Ifenthaler, 2011). Within this study, some teams produced texts 
under 300 words. A qualitative comparison of text with low word 
numbers and text with higher word numbers, however, did not 
reveal significant differences. Still, further studies are required to 
define the minimum word limit for valid AKOVIA analysis. 

Sixth, the present study focused on the assessment of written 
text using German language. As AKOVIA currently supports 
German as well as English natural language processing, a wider 


application using other languages is currently a clear limitation. 
Hence, a further development of AKOVIA should integrate several 
other language packages, for instance, those required in interna- 
tional large-scale studies. This would open up opportunities for 
further research focusing on the automated assessment of written 
texts in different languages. Such research would provide in-depth 
understanding of the validity of identical assessments in different 
languages and cultural contexts. 

Seventh, individual team members were not assessed with re- 
gard to their subjective solution of the task. Accordingly, a future 
study may include an experimental variation where participants are 
asked to individually respond to the task. These individual solu- 
tions may then be aggregated for further analysis using the 
AKOVIA methodology (Ifenthaler & Pirnay-Dummer, 2014). A 
resulting research question may investigate the difference between 
aggregated solutions (based on individual responses) and team- 
created solutions. 

Finally, a precise investigation of the learning-dependent pro- 
gression of SMMs and their influence on team-based performance 
was not realized in this study. Also, information about the task and 
team shared knowledge was collected through self-report measures 
that have clear limitations with regard to their reliability and 
validity (Miyake, 1986). As research on the progression of SMMs 
is scarce (Barron, 2000), future studies may investigate the pro- 
gression of IMMs and their influence on the progression of SMMs, 
and vice versa. Additionally, alternative assessment approaches for 
TaSMMs and TeSMMs may be investigated in order to better 
understand how these complex processes can be better assessed 
and optimized through interventions (Janssen, Erkens, Kirschner, 
& Kanselaar, 2010). 


Implications and Conclusion 


The demand for computer-based assessment methodologies is 
evident through the current focus of international large-scale as- 
sessments such as PISA and PIAAC, as they enforce the imple- 
mentation of alternative assessment environments. Clearly, there 
are numerous approaches for eliciting individual and team knowl- 
edge for various learning, teaching, and assessment purposes. For 
example, the recent introduction of mobile devices (e.g., tablets, 
PC) in K-12 education opens up new potentials for mobile learn- 
ing environments, including augmented reality for learning, teach- 
ing, and assessment purposes (Cheng & Tsai, 2013; Ifenthaler & 
Eseryel, 2013). Another example is the rapid growth of online 
work environments in which teams gather, learn, and work to- 
gether in virtual spaces, never meeting each other physically (Beer, 
Slack, & Armitt, 2005). Automated assessment and feedback sys- 
tems for virtual environments may facilitate the shared understand- 
ing of tasks and project goals as well as provide insights into the 
team’s progression toward common goals and successful solutions 
(Lenz & Machado, 2008). 

However, most automated assessment approaches have not been 
tested for reliability and validity and are almost only applicable to 
single or small sets of data and specific subject domains (Fischer 
& Mandl, 2005; Greiff, Wiistenberg, Molnar, et al., 2013; Ruiz- 
Primo, Schultz, Li, & Shavelson, 2001). Therefore, new ap- 
proaches are required that have not only been tested for reliability 
and validity but also provide a fast and economic way of analyzing 
larger sets of data (Greiff, Wiistenberg, Holt, Goldhammer, & 


662 IFENTHALER 






Empirical Research 










Deeper 
under- 
standing 













Need for more 
research 





Need for new 
development/ 
innovation 


Existing 
tool 

















Large-scale 
implementation 
research 





Validation and 
cross-validation 
research 













Figure 3. Research and methodology development framework. 


Funke, 2013). Additionally, approaches for educational assessment 
also need to move beyond the perspective of correct and incorrect 
solutions (Mislevy et al., 2010). Hence, as we move further into 
the 21st century, the application of alternative assessment strate- 
gies is inevitable for current educational assessment (Savenye, 
2014). 

On the basis of the recent experience of developing, implementing, 
and empirically testing the AKOVIA methodology, a framework for 
integrating research and methodology development is outlined in 
Figure 3. This framework may both inform the innovation process 
(validation research) and improve the research without widening the 
risk for the research results. Until final acceptance of new methodol- 
ogies such as AKOVIA, the standard tools are still used at that point. 
The triangulation will give interesting additional insight into research 
problems by means of post hoc analyses. Accordingly, the integration 
of methodological innovation alongside research standards and com- 
mon research will shorten the time for implementation without harm- 
ing the research process itself (Ifenthaler & Pirnay-Dummer, 2014). 
Without automation, many research projects would not be possible 
from the start. Hence, automation helps in raising the objectivity of 
outcomes and helps in realizing educational large-scale assessments. 
It also allows for a whole set of small and medium research projects 
to gain access to a reliable means of computer-based assessment 
(Csap6 et al., 2012). 

Given the recent developments in educational data mining and 
learning analytics (Long & Siemens, 2011), the automated assessment 
and analysis function of AKOVIA could be used to inform both 
decision makers (e.g., teachers, tutors, learning designers) and the 
learners during an ongoing learning process. Outcomes and results of 
these assessments could then be aggregated, transformed, and thus 
used to create feedback panels, dashboards, or even written feedback, 
based on the current learner and assessment model (Greller & 
Drachsler, 2012; Macfadyen & Dawson, 2012). Feedback on ongoing 
learning could be explicit by using results of the automated assess- 
ment and analysis (e.g., graphs, change indicators, as well as conver- 
gence toward a reference solution). Feedback could also be trans- 
formed for a more implicit use of the aggregations by implementing 
algorithms to create informative language-based feedback using quan- 
titative measures of AKOVIA and linking them with a phrase data- 
base or badges. 


Educational large-scale triangulation studies will continuously help 
to improve and converge the innovations in the methodology of 
assessment. To triangulate the use, usability, and feasibility of auto- 
mated assessment methodologies, a large data set of a national and 
international magnitude may be preferable. This process may be 
repeated for validation and comparability should new methodologies 
be available or if existing methodologies change significantly. Still, 
challenges and critical issues in automated computer-based assess- 
ment, such as data security, accessibility, comparability, and compli- 
ance, need to be addressed before these systems are fully implemented 
in educational and work-related settings. 

The scalability of the automated AKOVIA algorithm provides 
numerous applications for computer-based assessment environ- 
ments within international large-scale assessments, such as collab- 
orative problem solving, or teamwork. AKOVIA has the potential 
to widen the scope of current large-scale assessments and guide a 
new generation of cross-curricular natural language-oriented as- 
sessment environments. 
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The 21st-century work environment places strong emphasis on nonroutine transversal skills. In an 
educational context, complex problem solving (CPS) is generally considered an important transversal 
skill that includes knowledge acquisition and its application in new and interactive situations. The 
dynamic and interactive nature of CPS requires a computer-based administration of CPS tests such that 
the assessment of CPS might be partially confounded with information and communication technology 
(ICT) literacy. To establish CPS as a distinct construct that involves complex cognitive processes not 
covered by other general cognitive abilities and not related to ICT literacy, it is necessary to investigate 
the influence of ICT literacy on CPS and on the power of CPS to predict external educational criteria. 
We did so in 3 different samples of either high school or university students using a variety of instruments 
to measure ICT literacy and general cognitive ability. Convergent results based on structural equation 
modeling and confirmatory factor analyses across the studies showed that ICT literacy was weakly or 
moderately related to CPS, and these associations were similar to those between ICT and other general 
cognitive abilities. Furthermore, the power of CPS to predict external educational criteria over and above 
general cognitive ability remained even if the influence of ICT literacy on CPS was controlled for. We 
conclude that CPS is a distinct construct that captures complex cognitive processes not generally found 


in other assessments of general cognitive ability or of ICT literacy. 
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Occupational demands are changing rapidly in the 21st cen- 
tury (Autor & Dorn, 2009; Organisation for Economic Co- 
Operation and Development [OECD], 2010; Spitz-Oener, 
2006), and nonroutine skills are becoming more and more 
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important, whereas the importance of routine skills that are 
characterized by the repetitive occurrence of similar situations 
is decreasing (Autor, Levy, & Murnane, 2003). As nonroutine 
skills can be used in many situations and for different problems, 
these skills are by definition transversal rather than domain 
specific. Furthermore, facilitating transversal competencies is a 
central objective in a number of educational programs (Mayer 
& Wittrock, 2006), and transversal skills such as complex 
problem solving (CPS) play an important role in everyday life 
(Funke, 2010). Although transversal skills are found in a num- 
ber of areas and encompass skills such as metacognition, cre- 
ativity, as well as collaborative and CPS, this last skill is a 
particularly important and promising transversal skill that has 
recently received a lot of attention, especially in educational 
contexts. For instance, Mayer and Wittrock (2006) stated that 
one of education’s greatest challenges is making students good 
problem solvers. Therefore, it is not surprising that CPS is now 
an integral part of international educational large-scale assess- 
ments such as the Programme for International Student Assess- 
ment (PISA), arguably the most influential educational large- 
scale survey worldwide (OECD, 2009a). According to Buchner 
(Frensch & Funke, 1995), CPS can be defined as: 
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The successful interaction with task environments that are dynamic 
(1.e., change as a function of user’s intervention and/or as a function 
of time) and in which some, if not all, of the environment’s regular- 
ities can only be revealed by successful exploration and integration of 
the information gained in that process. (p. 14) 


CPS has some unique characteristics that distinguish it from 
other abilities such as reasoning or working memory (cf. Dorner, 
Kreuzig, Reither, & Staudel, 1983; Fischer, Greiff, & Funke, 2012; 
Funke, 2001). It refers to complex and nontransparent situations 
because not all of the necessary information to solve the problem 
is available until the problem solver interacts with the problem 
dynamically. That is, some information is hidden at the outset 
(Frensch & Funke, 1995). Furthermore, dynamic changes and 
highly interrelated elements in CPS require problem solvers to 
actively generate information by applying adequate strategies. 
Finally, multiple goals have to be taken into account when trying 
to solve a problem. Thus, a dynamic interaction between the 
problem solver and the task situation is an inherent feature of CPS 
(Wirth & Klieme, 2003), and this kind of interaction is not con- 
ceptually inherent to other cognitive abilities. These characteristics 
of CPS are typical of transversal skills, and this is why CPS has a 
prominent relevance among them. 

In general, CPS is composed of two overarching processes: the 
active acquisition of knowledge about a problem situation (knowl- 
edge acquistion; Mayer & Wittrock, 2006) and the active use of 
this knowledge, that is, finding a solution to a problem (knowledge 
application; Novick & Bassok, 2005). According to the definition 
of CPS and the aforementioned characteristics, especially those of 
interactivity and dynamics, the assessment of CPS should be 
particularly fruitful in the context of computer-based assessment 
(CBA). CBA provides a unique assessment environment for the 
required dynamic and interactive situations that cannot be pro- 
vided by the use of paper-and-pencil instruments (Kyllonen, 2009; 
Williamson, Mislevy, & Bejar, 2006). 

In the area of assessment, there has always been high interest in 
CPS as a higher order thinking skill (Kuhn, 2009) that may both 
conceptually and empirically complement assessments of other 
general cognitive abilities such as reasoning, working memory, 
perceptual speed, and so forth. In fact, recent findings have sug- 
gested that CPS has an added value beyond other general cognitive 
abilities. For example, an added value of CPS above and beyond 
reasoning abilities has been found in predictions of academic 
achievement (e.g., Greiff et al., 2013; Greiff, Wiistenberg, & 
Funke, 2012; Wiistenberg, Greiff, & Funke, 2012) and supervisor 
ratings of professional success (Danner, Hagemann, Schankin, 
Hager, & Funke, 2011). In general, it is assumed that the under- 
lying cognitive processes of CPS are correlated with and yet 
distinct from other general cognitive abilities (cf. Schweizer, 
Wiistenberg, & Greiff, 2013; Wiistenberg et al., 2012) and that in 
addition, more complex cognitive processes related to knowledge 
acquisition and knowledge application are responsible for the 
added value of CPS in predicting relevant external criteria (cf. 
Gonzalez, Thomas, & Vanyukov, 2005; Greiff, Wiistenberg, et al., 
2013: Wenke, Frensch, & Funke, 2005). However, a final embed- 
ding of CPS in the nomological network of theories on cognitive 
ability (e.g., in the Cattell-Horn-Carroll [CHC] theory; McGrew, 
2009) has not yet been fully accomplished (cf. Greiff, Wiistenberg, 
elralen2 Oli): 


ICT Literacy and How It Might Be Related to CPS 


In recent studies that have demonstrated the incremental validity 
of CPS beyond other cognitive abilities, it has been argued that 
there are unique characteristics and complex cognitive processes 
inherent in CPS and that these are not found in the conceptualiza- 
tions of other general cognitive abilities (Raven, 2000). However, 
due to the fact that CPS assessment instruments require computer- 
based test administration, researchers cannot rule out the possibil- 
ity that the added value of CPS may stem from an influence of 
computer literacy on CPS test results. In this line of thinking, CPS 
tests would then provide an indirect measure of information and 
communication technology (ICT) literacy. 

Due to the enhanced complexity and attractiveness of CBAs, it 
has been assumed that ICT literacy might have a strong impact on 
performance in CBAs, especially if these assessments require more 
complex interactions with the computer, a requirement that holds 
in particular for CPS. In a comprehensive definition, Tsai (2002) 
described ICT literacy as “the basic knowledge, skills, and atti- 
tudes needed by all citizens to be able to deal with computer 
technology in their daily life” (p. 69). Thus, declarative, proce- 
dural, and attitudinal aspects are covered by this conceptualization, 
which indicates that it is not only computer knowledge and skills 
that are important for handling CBAs. Affective components can 
influence performance as well. For instance, high computer anxi- 
ety may lead to discomfort when using the computer, resulting in 
lower performance on exploratory behavior. Thus, the added value 
of CPS above and beyond general cognitive ability could be due to 
tests of CPS inadvertently providing an indirect assessment of ICT 
literacy rather than to CPS representing additional complex cog- 
nitive processes required by the problem-solving situation. As a 
consequence, the overall validity of the CPS construct as a com- 
plex cognitive skill that is distinct from other general cognitive 
abilities might be threatened because empirical access to this 
construct is inevitably bound to the computer-based administration 
mode (cf. Parshall, Spray, Kalohn, & Davey, 2002; Russell, Gold- 
berg, & O'Connor, 2003). 

This concern is an important one especially against the back- 
ground of the general shift from paper-pencil tests toward CBA 
(Goldhammer, Naumann, & Kefel, 2013). For example, early 
large-scale assessments (e.g., the PISA survey in 2003) used only 
paper-and-pencil assessments. Computer-based testing was partly 
introduced in PISA 2006 (OECD, 2007), substantially extended in 
PISA 2012 (OECD, 2010), included in the Programme for the 
International Assessment of Adult Competencies (OECD, 2009b), 
and, finally, will constitute the major mode of delivery in PISA 
2015 (OECD, 2012). This progression can be accounted for by 
several general advantages of CBA (cf. Scheuermann & Bjoérns- 
son, 2009; Van der Linden & Glas, 2000) such as high standard- 
ization and test efficiency, the logging of behavioral and process 
data, the possibility of automatic scoring, and the application of 
adaptive testing. 

As any computer-administered test requires the test taker to 
interact with the computer, the influence of ICT literacy can be 
considered a general threat to the validity of any construct assessed 
via this mode of administration; thus, the issue is not limited to 
CPS. Studies concerning mode-of-delivery effects in general have 
addressed this topic but have provided inconsistent results. In an 
older meta-analysis, Mead and Drasgow (1993) found no overall 
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difference between paper-pencil and CBAs. Nevertheless, the au- 
thors warned against drawing the conclusion that there is no test 
mode effect and thus no influence of ICT literacy at all. Therefore, 
it is not surprising that several studies have reported relevant 
test-mode effects (for an overview, see Clariana & Wallace, 2002; 
Russell et al., 2003). These inconsistent results might also be due 
to the fact that different types of tests imply more or less complex 
interactions with the computer and thus require different levels of 
ICT literacy. 

In this sense, constructs such as CPS, which are based on more 
innovative item types that reflect the dynamic interaction and 
display features that are offered by the computer, are also more 
prone to the undesirable influence of ICT literacy. At the same 
time, for tests that rely on these new item types such as CPS, it is 
not possible to conduct studies on test-mode administration effects 
because the classical paper-and-pencil mode of administration is 
simply not available for these item types and thus cannot be 
compared with the computer-administrated mode. For other gen- 
eral cognitive abilities such as reasoning and working memory, 
empirical studies can be conducted to examine how assessment 
may be affected by changing the mode of delivery from paper- 
and-pencil to computer based. The question then arises: How can 
researchers understand and quantify the influence of ICT literacy 
on an assessment of complex cognitive skills such as CPS when 
the assessment can be administered only on the computer (Kyl- 
lonen, 2009)? That is, for CPS, it is yet unclear whether its added 
value in predicting external criteria (e.g., Schweizer et al., 2013; 
Wiistenberg et al., 2012) originates from the indirect assessment of 
ICT literacy or from the assessment of additional and relevant 
cognitive processes as mentioned above and as conceptually as- 
sumed, 


The Added Value of CPS: Cognitive Processes or 
Merely ICT Literacy? 


The central question of the current study is about the influence 
of ICT literacy on the CBA of CPS. Specifically, we asked how we 
could explain the added value in terms of the incremental validity 
of CPS in predicting relevant external criteria above and beyond 
general cognitive ability. We proposed two conspicuous explana- 
tions: additional cognitive processes, on the one hand, and addi- 
tional demands on students’ ICT literacy, on the other hand. 
Establishing the construct of CPS as a transversal skill would be 
warranted and an assessment of CPS in international large-scale 
studies would be justified only if the first explanation were to hold. 

To tackle this question, we had to examine the simultaneous 
influence of general cognitive ability and ICT literacy on CPS. In 
general, cognitive abilities and CPS share cognitive processes to a 
certain extent (cf. Greiff, Wtistenberg, et al., 2013; Wiistenberg et 
al., 2012). However, according to the definition and characteristics 
of CPS, unique processes are supposed to be inherent to CPS. 
Different from general cognitive ability, which mainly requires a 
mere sequence of simple cognitive processes, CPS requires a series 
of different cognitive processes such as action planning and im- 
plementing, strategic development, knowledge acquisition, and 
self-regulation (Funke, 2010; Raven, 2000). 

As outlined above, ICT literacy might also have an influence on 
CPS as a consequence of the mode of delivery. Each CBA requires 
at least basic computer knowledge as well as the related perceptual 


and motor skills that are needed to use the computer interface. 
Furthermore, by definition, higher ICT literacy leads to a more 
familiar and intuitive handling of the computer interface, or to look 
at it the other way around, if a student’s ICT literacy in an 
assessment context is very low, cognitive resources have to be 
used to understand the computer interface, for example. This 
would tie up a large amount of cognitive capacity that would then 
not be available for CPS even if the core interest of the assessment 
lies in CPS (Goldhammer et al., 2013; see also cognitive load 
theory; Sweller, 2005). In conclusion, to determine whether CPS is 
more than general cognitive ability and ICT literacy combined, the 
relations of both of them to CPS must be examined simultane- 
ously. 

Surprisingly, there are hardly any empirical findings with regard 
to the impact of ICT literacy on CPS and none at all with regard 
to the relations between ICT literacy, general cognitive ability, and 
CPS. Hartig and Klieme (2005) reported small relations between 
CPS and self-reported ICT literacy. Further, Sii8 (1996) reported 
moderate to high correlations between objective indicators of ICT 
literacy and CPS. However, these early studies did not account for 
the development of innovative, more user-friendly computer inter- 
faces or the substantial changes in the use and importance of 
computers in everyday life; these changes have thus resulted in 
study participants who can be considered “digital natives” (Pren- 
sky, 2001). In a recent study, Sonnleitner, Keller, Martin, and 
Brunner (2013) highlighted that an added value of CPS beyond 
reasoning is found only in academic achievement criteria that are 
assessed via computer but not in paper-pencil-assessed criteria. 
They concluded that the added value of CPS is merely an effect of 
test mode and thus of ICT literacy. 

Overall, there are two possible explanations for the added value 
of CPS recently reported in the literature: complex cognitive 
processes that are not included in concepts of general cognitive 
ability or an indirect but substantial influence of ICT literacy on 
the CBA of CPS. However, there are very few studies that have 
targeted this issue, and these studies have produced inconsistent 
findings. The purpose of the current study was to take a deeper 
look into the impact of ICT literacy on CPS and on CBAs of 
transversal skills in general. 


Purpose of the Study 


Generally speaking, we want to advance knowledge on the 
question of how CPS and ICT literacy are related to each other and 
whether CPS indeed yields a valuable marker of additional com- 
plex cognitive processes or whether it is a confounded indicator of 
general cognitive ability and ICT literacy. Thus, we addressed the 
question of whether the added value of CPS reported in some 
studies could be explained by ICT literacy or, in other words, 
whether CPS is something other than general cognitive abilities 
such as reasoning or working memory and ICT literacy combined. 
To this end, we derived three research questions. 

Research Question 1: How are ICT literacy and CPS related to 
each other? 

For the first question, we examined latent correlations between 
ICT literacy and the CPS dimensions of knowledge acquisition and 
knowledge application. 
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Research Question 2: Does ICT literacy more strongly predict a 
CBA of CPS than ICT literacy predicts the assessment of general 
cognitive ability? 

In the next step, we analyzed the latent regression of CPS and 
general cognitive ability on ICT literacy and tested whether ICT 
literacy would predict CPS more strongly than it predicted general 
cognitive ability. 

Research Question 3: Can the added value of CPS be explained 
by ICT literacy or are distinct cognitive processes in CPS respon- 
sible for the added value? 

For the last question, we examined whether CPS could explain 
academic achievement above and beyond general cognitive ability 
and ICT literacy combined or whether controlling for general 
cognitive ability and ICT literacy would result in the nonsignifi- 
cant prediction of external criteria such as academic achievement 
by CPS. 

To answer these questions, we used three different and diverse 
samples containing both high school and university students. In all 
these samples, the added value of CPS beyond general cognitive 
ability has been shown to exist in previously published articles 
(Study A: Wiistenberg et al., 2012; Study B: Greiff, Fischer, et al., 
2013; Study C: Schweizer et al., 2013). However, the added value 
of CPS beyond ICT literacy and general cognitive ability was not 
tested in any of these studies. Thus, we extended the original 
analyses by adding ICT literacy, which was operationalized in 
diverse ways. According to its definition (see above; Tsai, 2002), 
ICT literacy is a broad concept composed of cognitive and affec- 
tive aspects. Thus, a valid assessment of ICT literacy needs to 
endorse different operationalizations and methods (Ballantine, Mc- 
Court Larres, & Oyelere, 2007; Goldhammer et al., 2013; Van 
Braak, 2004). Therefore, we used subjective self-reports and dif- 
ferent objective performance tests. Consequently, we used differ- 
ent operationalizations of general cognitive ability as well: figural 
reasoning and working memory capacity. Finally, an acknowl- 
edged and well-validated measure of CPS, namely MicroDYN 
(Greiff et al., 2012), was used in all three samples. To sum up, this 
approach allowed us to examine our research questions in different 
samples with heterogeneous assessments of ICT literacy and gen- 
eral cognitive ability. Our findings can thus be cross-checked to 
ensure that they are replicable and generalizable (Brennan, 1983). 


Method 


Assessment Instrument for CPS: MicroDYN 


In all three studies (Study A, Study B, and Study C), a set of 
tasks used in the MicroDYN approach (Greiff et al., 2012) was 
used to assess CPS. In MicroDYN, students are first asked to 
detect causal relations in a dynamic system composed of several 
input and output variables. Subsequently, they are asked to control 
the system. These two tasks directly relate to the two characteristic 
CPS dimensions introduced above, knowledge acquisition and 
knowledge application, thus ensuring the theoretical embedding of 
the MicroDYN approach. 

Recent results have indicated that MicroDYN is a reliable 
(consistent Cronbach’s as > .70; cf. Greiff et al., 2012; Wiisten- 
berg et al., 2012) and valid assessment instrument (Greiff, Wtisten- 
berg, et al., 2013; Molnar, Greiff, & Csap6, 2013; Schweizer et al., 
2013; Wiistenberg et al., 2012) that sufficiently reflects the theo- 


retical concept of CPS. For instance, MicroDYN as an operation- 
alization of CPS shows incremental validity in predicting aca- 
demic achievement beyond general cognitive abilities such as 
reasoning and working memory (Greiff, Wiistenberg, et al., 2013; 
Schweizer et al., 2013; Wiistenberg et al., 2012). Further, a sub- 
stantial number of the items used to assess CPS in 15-year-old 
students across a number of countries in the PISA 2012 study were 
developed within the MicroDYN approach. 

A set of MicroDYN tasks typically encompasses five to 10 
independent complex problems (also referred to as microworlds in 
the literature; cf. Funke, 2001), with time on task being approxi- 
mately 5 min for each CPS task. Each task has an underlying 
causal structure unknown to the student and is divided into two 
subsequent phases: Phase 1, in which knowledge acquisition is 
assessed, and Phase 2, in which knowledge application is assessed. 
As an illustration, consider the MicroDYN task handball training 
displayed in Figure |. There, input variables (.e., different training 
strategies labeled Strategy A, Strategy B, Strategy C) influence 
several output variables (i.e., characteristics of the team labeled 
Motivation, Power of the throw, Exhaustion). In Phase |, students 
can freely explore the task (duration: 3 min) by manipulating the 
sliders on the left and by observing subsequent changes in the 
output variables on the right (cf. Figure 1). During this free 
exploration, students are asked to specify the relations between 
variables on a concept map displayed at the bottom of Figure | by 
drawing arrows between input and output variables (e.g., between 
Strategy A and Motivation), thereby capturing their mental repre- 
sentation of the underlying system structure. In Phase 2, students 
are instructed to reach given goal values on the output variables 
(e.g., increasing the Power of the throw to five) by manipulating 
the input variables in the correct way (e.g., increasing Strategy B; 
duration: 1.5 min). Each MicroDYN task is embedded in a differ- 
ent cover story, and inputs as well as outputs are labeled without 
deep semantic meaning to increase motivation and minimize the 
influence of prior knowledge. 

Depending on the specific number of tasks, a CPS assessment 
with MicroDYN takes between 40 and 60 min including instruc- 
tions. Detailed information on the rationale underlying these types 
of tasks can be found in Funke (2001), and the MicroDYN ap- 
proach is described in detail in Greiff et al. (2012) and Schweizer 
euralen (ZO) 

In all three samples, a set of MicroDYN tasks was used to 
capture knowledge acquisition and knowledge application as core 
dimensions of CPS (eight tasks in Study A, 10 tasks in Study B, 
and seven tasks in Study C). However, with regard to differences 
in cognitive potential across the three samples (e.g., high-ability 
university students with above-average cognitive performance in 
Study A and average-ability high school students in Grade 8 in 
Study C), MicroDYN task difficulty was adjusted accordingly. To 
increase difficulty for the more able samples, the underlying sys- 
tem structures of the MicroDYN tasks were designed to be more 
complex by increasing the number of inputs and outputs, by 
increasing the number of relations between them, and by introduc- 
ing outputs that changed by themselves over time without active 
manipulation of the inputs (for details on altering difficulty in 
MicroDYN, cf. Greiff et al., 2012). 

With regard to the scoring of MicroDYN, full credit for knowl- 
edge acquisition was given if students’ models contained no mis- 
takes. If additional relations were reported or actual relations were 
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Figure 1. Screenshot of the MicroDYN task handball training. The controllers of the input variables range 
from “- -” (value = —2) to “++” (value = +2). The current values and the target values of the output variables 


are displayed numerically (e.g., current value for Motivation: 21; target values: 16-18) and graphically (current 
value: dots; target value: red line). The correct model is shown at the bottom of the figure (cf. Wiistenberg et 


al., 2012). 


omitted, zero credit was assigned. A full score in knowledge 
application was given if goal values were reached, whereas no 
credit was given if target values were not reached (for details on 
scoring, cf. Greiff et al., 2012; Kroner, Plass, & Leutner, 2005). 
Thus, each MicroDYN task yielded indicators on knowledge ac- 
quisition and knowledge application totaling eight, 10, and seven 
indicators in Studies A, B, and C, respectively, for each of the two 
CPS dimensions. 


Study A: Relations Among CPS Components, 
Computer Knowledge, Computer Anxiety, Figural 
Reasoning, and Final Grade-Point Average 


Participants. The final sample consisted of N = 222 high- 
ability university students (69% female; age: M = 22.8; SD = 4.0) 
majoring mainly in psychology. In psychology, admission depends 
on final school grade-point average (GPA), and the selection 
process is highly competitive. As a consequence, psychology 
students at German universities usually have above-average cog- 
nitive performance. Students received partial course credit for 
participation and an additional obol (€5 [about $6 U.S.]) for 
working conscientiously. Missing data that occurred due to soft- 
ware problems or a failure of participants to work conscientiously 


led to n = 16 exclusions from the initial sample. The study took 
place in the Department of Psychology at the University of Heidel- 
berg, Germany. 

Materials. 

CPS. MicroDYN with eight different tasks was used for the 
CPS assessment. 

ICT literacy. \CT literacy was assessed using two subtests 
from the German inventory for the assessment of computer liter- 
acy, computer-related attitudes, and computer anxiety (Revised 
Computer Literacy Inventory, INCOBI-R; Richter, Naumann, & 
Horz, 2010). Both tests were administered on computers. The first 
subscale, Practical Computer Knowledge (PRACOWI), contains 
20 written scenarios of commonly occurring computer problems. 
For each scenario, one of four presented solutions is correct. The 
subscale is substantially correlated with measures of computer use 
and predicts the ability to master everyday computer tasks (Appel, 
2012; Naumann, Richter, & Groeben, 2001; Richter, Naumann, & 
Groeben, 2001; Richter et al., 2010). It distinguishes between 
computer experts and novices (Naumann et al., 2001) and is best 
described by a one-dimensional model (Richter et al., 2010). The 
scale shows good internal consistency (Cronbach’s a = .83), and 
the items are one-dimensional according to the Rasch model. Thus, 
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PRACOWI is a reliable and valid (e.g.. r = .60 with basic 
computer skills) measure of the ability to deal successfully with 
everyday computer tasks and problems. It represents the declara- 
tive knowledge scope of ICT literacy described by Tsai (2002). 
The second subscale, Computer Anxiety (COMA), captures wor- 
ries about the personal use of computers and computer-related 
anxiety. Computer anxiety is seen as a trait that includes cognitive 
and affective components (Morris, Davis, & Hutchings, 1981; 
Richter et al., 2010). The items refer to feelings of anxiety (e.g., 
“Working with the computer makes me uneasy”) as well as to 
cognitions of concern (e.g., “When working with the computer, I 
am often afraid of breaking something”). The subscale covers the 
scope of attitudes toward ICT literacy (cf. Van Braak, 2004). 
Discriminant and criterion validity (- = —.33 with duration of 
computer experience) and good reliability (Cronbach’s a = .82) of 
the COMA have been shown in several samples (e.g., Appel, 2012; 
Richter et al., 2010). The subscale consists of eight self-report 
items that are rated on a 5-point Likert scale (from —2 = do not 
agree to 2 = agree), with higher values indicating higher anxiety. 

General cognitive ability. Figural reasoning as a general cog- 
nitive ability was assessed using a computer-adapted version of the 
Advanced Progressive Matrices (APM; Raven, 1958). This test is 
viewed as a valid indicator of fluid intelligence (Raven, Raven, & 
Court, 1998) and shows good internal consistency (Cronbach’s 
a = .85). Each item was scored dichotomously. 

Academic achievement. Academic achievement was mea- 
sured as students’ self-reported final school GPA at the end of 
schooling. As usual in German schools, school marks ranged from 
1 (excellent) to 6 (poor). For further analyses, we reversed the 
school marks so that higher numerical values reflected better 
performance. 

Procedure. Testing was split into two sessions, each lasting 
approximately 50 min. In the first session, students worked on 
MicroDYN. In the second session, the APM, PRACOWI, and 
COMA were administered. Afterwards, students provided demo- 
graphic data as well as school marks. 


Study B: Relations Among CPS Components, Basic 
Computer Skills, Computer Anxiety, Figural 
Reasoning, and Final School Marks 


Participants. The sample consisted of 341 university students 
(67% female; age: M = 22.3; SD = 4.0) with a broad study 
background who were majoring mainly in social sciences. Students 
received either partial course credit or a financial reimbursement 
of €20 (about $25 U.S.) for their participation. The study took 
place in the Department of Psychology at the University of Heidel- 
berg, Germany. 

Materials. 

CPS. MicroDYN with 10 different tasks was used for CPS 
assessment. 

ICT literacy. ICT literacy was assessed with two different 
instruments. First, a further developed version of the Basic Com- 
puter Skills Test (BCS; Goldhammer et al., 2013) was used; it 1s 
considered a computer-based, objective, and performance-based 
measure of basic ICT skills in line with Tsai (2002). The 20 tasks 
require students to access, collect, and provide information in 
simulated graphical user interfaces of several computer environ- 
ments (e.g., web browser, text editor). The environments, although 


only an abstract representation of real software, share general 
characteristics of real computer environments (for more details and 
task descriptions, see Goldhammer et al., 2013). Further, Gold- 
hammer et al. (2013) reported substantial correlations with other 
measures of computer skills (e.g., r = .60 with PRACOWID), 
discriminant validity (e.g., r = .32 with word recognition), unidi- 
mensionality, and good reliability (Cronbach’s a = .70). There- 
fore, the BCS can be considered to be a valid measure of ICT 
literacy. For each task, the correct user response (BCS ability 
according to Goldhammer et al., 2013) was given full credit; 
otherwise, no credit was given. As a second measure, the COMA 
of the INCOBI-R (Richter et al., 2010) as in Study A (see above) 
was used. 

General cognitive ability. Figural reasoning as a general cog- 
nitive ability was assessed using a computer-adapted version of the 
matrices subtest of the Intelligence Structure Test-Revised (Beau- 
ducel, Liepmann, Horn, & Brocke, 2010). This test is viewed as a 
good indicator of reasoning (cf. Carroll, 1993) but consists of more 
diverse task contents than the APM test used in Study A. Accord- 
ing to the test manual, the matrices subtest showed an acceptable 
reliability (Cronbach’s a = .71) and validity (Beauducel et al., 
2010). Each item of the subtest was scored dichotomously. 

Academic achievement. Academic achievement was reported 
as final school marks when leaving high school in four natural 
science subjects (math, physics, chemistry, and biology) and five 
subjects that consisted of either social sciences or languages (Ger- 
man, English, history, geography, and social studies). School 
marks were reversed for all analyses so that higher numerical 
values reflected better performance. 

Procedure. Testing was divided into two sessions of 2.5 and 
2 hr. In the first session, students worked on MicroDYN and 
provided demographic data as well as school marks. In the second 
session, the IST, BCS, and COMA were administered. Additional 
measures that were not relevant for this article were administered 
in both the first and second sessions. 


Study C: Relations Among CPS Components, 
Computer Anxiety, Working Memory Capacity, and 
Annual School Marks 


Participants. The sample consisted of 389 high school stu- 
dents (60% female; age: M = 17.1; SD = 1.1). Students were 
offered individual feedback on their results in return for their 
participation. From the initial sample, n = 16 students were 
excluded from the analyses because of software errors. The study 
took place at two German high schools, both located in southwest- 
ern Germany. 

Materials. 

CPS. MicroDYN with seven different tasks was used for the 
CPS assessment. 

ICT literacy. \CT literacy was assessed using the COMA from 
the INCOBI-R (Richter et al., 2010) as in Study A. 

General cognitive ability. Numerical and spatial working 
memory capacity were assessed as measures of general cognitive 
ability using a computer version of the memory updating numer- 
ical task (MUN; Oberauer, Sii8, Schulze, Wilhelm, & Wittmann, 
2000; Sander, 2005). The aim of the MUN task is to remember the 
values and the locations of several numbers displayed on the 
screen, to mentally modify the values according to the task (“up- 
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dating”), and to return the modified numbers in the proper places 
(for a detailed description, see Schweizer et al., 2013). Thus, the 
MUN requires the storage and transformation of information as 
well as its coordination. The MUN task is used as a marker task for 
working memory (cf. Oberauer, Sii8, Wilhelm, & Wittmann, 2003) 
because of its good reliability (Cronbach’s a = .81) and validity 
(e.g., substantial factor loadings on the working memory factor 
simultaneous storage and transformation and the factor coordina- 
tion). For each item, the percentage of correctly reproduced num- 
bers was used as the performance indicator. 

Academic achievement. Academic achievement was reflected 
in school marks from the latest annual school certificate in four 
natural science subjects (math, physics, chemistry, and biology) 
and three social science subjects (history, geography, and social 
studies). School marks were reversed for all analyses so that higher 
numerical values reflected better performance. 

Procedure. Testing consisted of one session of 1.5 hr. Stu- 
dents worked on MicroDYN, the MUN, and COMA and provided 
demographic data as well as school marks. Additional measures 
that were not relevant for this article were administered subse- 
quently. 


Summary of Measures 


Overall, two dimensions of CPS, knowledge acquisition and 
knowledge application, were assessed on the computer by Micro- 
DYN (in all three studies); ICT literacy was assessed on the 
computer by measuring practical computer knowledge (PRACOWI; 
Study A), basic computer skills (BCS; Study B), and computer 
anxiety (COMA; Studies A, B, and C); general cognitive ability 
was assessed on the computer by measuring figural reasoning (the 
APM in Study A, a subtest of the IST in Study B) and working 
memory capacity (Study C); and academic achievement was mea- 
sured by overall GPA (Study A), final school marks when leaving 
high school (Study B), and school marks from the students’ latest 
annual school certificate (Study C). 


Statistical Methods 


Data were analyzed using confirmatory factor analysis (CFA) 
and structural equation modeling (SEM; cf. Bollen, 1989): that is, 
all reported results and coefficients were measured on a latent level 
without measurement error. All models were estimated with the 
software MPlus 7.0 (Muthén & Muthén, 2012). To evaluate model 
fit for SEM, we applied standard fit indices such as the compar- 
ative fit index (CFI), Tucker-Lewis Index (TLI), root-mean-square 
error of approximation (RMSEA), standardized root-mean-square 
residual (SRMR), and weighted root-mean-square residual 
(WRMR) by endorsing the cutoff values recommended by Hu and 
Bentler (1999). Unless noted otherwise, we applied standard max- 
imum likelihood estimation and used the full information maxi- 
mum likelihood estimation method to adjust for missing data in 
order to ensure high statistical power for the detection of small 
effects. 

In all three studies, we used the following statistical analyses for 
each research question: 

To tackle the first research question, we analyzed the relation 
between CPS and ICT literacy by computing latent correlations 
(1.e., correlations adjusted for measurement error). 


To tackle the second research question, we analyzed latent 
relations between CPS and general cognitive ability as criteria 
and ICT literacy as a predictor in a first model. In this model, 
the path coefficients from ICT literacy to CPS and general 
cognitive ability indicated the corresponding impact of ICT 
literacy. However, even if the path coefficient from ICT literacy 
to CPS were stronger than the path coefficient from ICT literacy 
io general cognitive ability, it would not be clear whether ICT 
literacy had a greater influence on CPS or whether the variation 
in path coefficients occurred merely by chance. To test this 
question statistically, we modified the first model to derive a 
second model. In this second model, the path coefficients from 
ICT literacy to CPS and general cognitive ability were con- 
strained to equality to simulate an equal impact of ICT literacy 
on CPS and general cognitive ability. The subsequent change in 
model fit between the first and the second models provided the 
answer about whether the difference in impact of ICT literacy 
was statistically meaningful. If the chi-square difference be- 
tween the unconstrained (i.e., first) and constrained (i.e., sec- 
ond) model turned out to be significant, this would indicate that 
constraining the parameters to equality significantly worsened 
the model fit, and, therefore, we would have to assume an 
unequal impact of ICT literacy on CPS and general cognitive 
ability. If several measures of ICT literacy were used in one 
study (i.e., in Studies A and B), constraints were imposed 
separately for each measure in order to be able to quantify the 
impact of the specific ICT literacy measure on CPS and general 
cognitive ability. We used chi-square difference tests with the 
mean- and variance-adjusted maximum likelihood estimator (cf. 
Muthén & Muthén, 2012) for these analyses. 

To tackle the third research question, we analyzed the incre- 
mental validity of CPS with regard to academic achievement as 
the criterion in all three studies. In other words, after control- 
ling for general cognitive ability and ICT literacy, we entered 
CPS as a third predictor in the equation. In detail, we checked 
the predictive validity of ICT literacy and general cognitive 
ability in a single model. In a second step, we further added 
CPS and thus entered all constructs into one model. In this latter 
model, CPS was regressed on general cognitive ability and ICT 
literacy first. The CPS residuals of this regression as well as 
general cognitive ability and ICT literacy were then used to 
predict academic achievement. If the path coefficients of the 
CPS residuals ended up being significant, this would indicate 
that CPS explained additional variance above and beyond gen- 
eral cognitive ability and ICT literacy. That is, the added value 
of CPS would then not be attributable to an indirect assessment 
of ICT literacy within CPS measures (for more details on this 
specific regression procedure, see Wiistenberg et al., 2012). 
When several indicators of academic achievement were avail- 
able (Studies B and C), we used CFA to calculate latent grade 
factors (one factor for natural sciences and one factor for social 
sciences and languages) instead of using a manifest grade 
marker of academic achievement (Study A). Furthermore, if the 
indicators for academic achievement were ordered categorical 
variables, we used the weighted least squares mean- and 
variance-adjusted estimator (Muthén & Muthén, 2012) for the 
statistical analysis of the last research question. 
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Results 


The purpose of this article was to examine the influence of ICT 
literacy on CPS. Therefore, instead of turning our attention to 
verifications of different measurement models, we referred to 
already existing research to derive the measurement models. That 
is, the structure and dimensionality as well as the model fit of the 
measurement models were described in the corresponding articles: 
for CPS and general cognitive ability, Wiistenberg et al. (2012: 
Study A), Greiff, Wiistenberg, et al. (2013; Study B), and 
Schweizer et al., (2013; Study C); for basic computer skills, 
Goldhammer et al. (2013; Study B); and for practical computer 
knowledge and computer anxiety, Richter et al. (2010; Study A 
and all three studies, respectively). On this basis, we created 
parcels (according to the item-to-construct balance recommended 
by Little, Cunningham, Shahar, & Widaman, 2002) for each mea- 
surement (i.e., measures of CPS, ICT literacy, and general cogni- 
tive ability) in order to better capture the latent constructs and to 
increase the accuracy of parameter estimations. The model fit for 
all parceled measurement models in our study was at least accept- 
able (i.e., CFI and TLI > .95: RMSEA < .06; SRMR < .05 or 
WRMR < .90). Comprehensive correlation tables are available in 
the supplementary material. 


Results for Research Question 1: How Are ICT 
Literacy and CPS Related to Each Other? 


For each analysis used to address this research question, all 
structural models showed good model fit (i.e., CFI and TLI > .95; 
RMSEA < .06; SRMR < .05 or WRMR < .90). 

Study A. The latent correlation between the two measures of 
ICT literacy used in Study A (i.e., practical computer knowledge 
and computer anxiety) of r = —.73 (p < .01) was about the same 
size as the original r = —.59 reported by Richter et al. (2010). 
However, the latter correlation was on a manifest level and was 
thus uncorrected for measurement error, whereas the former was 
corrected for measurement error. Correlations between practical 
computer knowledge and both knowledge acquisition (r = .44, 
p < .01) and knowledge application (r = .36, p < .O1) were 
moderate in size. Furthermore, correlations between computer 
anxiety and both knowledge acquisition (r = —.25, p < .01) and 
knowledge application (r = —.20, p < .05) were small in size but 
statistically significant. 


Study B. The two measures of ICT literacy used in Study B 
(i.e., basic computer skills and computer anxiety) were moderately 
correlated (r = —.30, p < .01). Correlations between basic com- 
puter skills and both knowledge acquisition (r = .58, p < .01) and 
knowledge application (r = .63, p < .01) were large and signifi- 
cant. Smaller, but still significant, were the correlations between 
computer anxiety and both knowledge acquisition (r = —.22, p < 
.O1) and knowledge application (r = —.31, p < .01). Although the 
latter were slightly higher than in Study A, the relations between 
computer anxiety and CPS were similar between the two studies. 

Study C. There were small but significant correlations be- 
tween computer anxiety and both knowledge acquisition 
(r = —.17, p < .01) and knowledge application (r = —.21, p < 
Ol). Again, the sizes of the coefficients were similar to the other 
studies. 

In conclusion, in all three studies, there were significant rela- 
tions between different operationalizations of ICT literacy and 
CPS. In detail, the relations between both knowledge-based and 
behavioral measures of ICT literacy (i.e., practical computer 
knowledge and basic computer skills) and CPS were higher than 
the relation between attitudinal measures of ICT literacy (i.e., 
computer anxiety) and CPS. The expected modest correlations 
between the CPS components and operationalizations of ICT lit- 
eracy indicate that the two constructs are separable. 


Results for Research Question 2: Does ICT Literacy 
More Strongly Predict a CBA of CPS Than ICT 
Literacy Predicts the Assessment of General 
Cognitive Ability? 


Study A. The model with practical computer knowledge and 
computer anxiety as simultaneous predictors of CPS and reasoning 
showed a good overall model fit (Model A.1 in Table 1; see Figure 
2 for an illustration of the type of model evaluated in Research 
Question 2). As expected, practical computer knowledge predicted 
knowledge acquisition (B = .59, p < .01), knowledge application 
(8 = .47, p < .01), and reasoning (8 = .55, p < .01). However, 
computer anxiety was not a significant predictor at all in this 


model (knowledge acquisition: 8 = —.23; knowledge application: 
8 = —.20; reasoning: B = —.19; all ps > .05; see Figure 2) even 


though in the bivariate analysis used to address Research Question 
1, computer anxiety was correlated with CPS. The high correlation 





Table | 
Fit Indices for Different Models for Research Question 2 

Model ye df Pp CFI TNE RMSEA WRMR 
Model A.1 116.43 94 06 .976 .969 037 .798 
Model A.2: constrained for PRACOWI 117.43 96 07 977 971 035 .800 
Model A.3: constrained for COMA NG? 96 07 977 972 35 197 
Model B.1 90.08 80 a2 987 983 024 908 
Model B.2: constrained for BCS 91.39 82 22 988 984 .023 .907 
Model B.3: constrained for COMA 93.49 82 18 985 981 .026 .962 
Model C.1 69.55 59 16 .992 989 023 154 
Model C.2: constrained for COMA 77.91 61 07 987 984 029 MST 





Note. x°* and df estimates are based on mean- and variance-adjusted maximum likelihood. CFI = comparative fit index; TLI = Tucker-Lewis Index; 
RMSEA = root-mean-square error of approximation; WRMR = weighted root-mean-square residual; PRACOWI = Practical Computer Knowledge; 


COMA = Computer Anxiety; BCS = Basic Computer Skills. 
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Figure 2. Model A.1 for Research Question 2. Reasoning, knowledge 
acquisition, and knowledge application were regressed on practical com- 
puter knowledge (PRACOWJ) and computer anxiety (COMA). Parcels are 
not depicted. Standard errors are in parentheses. ““ p < .001. 


between practical computer knowledge and computer anxiety (see 
Study A with regard to Research Question 1) as well as the lower 
power of this analysis due to the increased number of variables in 
the model in Figure 2 were possible explanations for the nonsig- 
nificant prediction of computer anxiety. 

If the paths from practical computer knowledge to CPS and 
reasoning were constrained to equality (Model A.2), model fit did 
not decrease significantly, y°(2) = 1.338, p > .05. Constraining 
the paths from computer anxiety to CPS and reasoning to equality 
did not significantly decrease model fit either (Model A.3), 
x7 (2) = 0.670, p > .05. Thus, CPS was not more strongly 
predicted by ICT literacy than was general cognitive ability. 

Study B. The model in which basic computer skills and com- 
puter anxiety simultaneously predicted CPS and reasoning showed 
a good overall model fit (Model B.1 in Table 1). Basic computer 
skills predicted knowledge acquisition (B = .44, p < .01), knowl- 
edge application (8 = .52, p < .01), and reasoning (8 = .47, p < 
.O1). Computer anxiety was a significant predictor of knowledge 
acquisition (8 = —.14, p < .05) and knowledge application 
(8 = —.23, p < .05), but not of reasoning (8B = —.04, p > .05). 
If the paths from basic computer skills to CPS and reasoning were 
constrained to equality, the model did not fit significantly worse 
(Model B.2), y7(2) = 1.616, p > .05. Constraining the paths from 
computer anxiety also did not significantly decrease the model fit 
(Model B.3), x7(2) = 4.177, p > .05.' In sum, CPS was not more 
strongly predicted by ICT literacy than was general cognitive 
ability. 

Study C. The model with computer anxiety as a predictor of 
CPS and working memory capacity showed a good overall model 


fit (Model C.1 in Table 1). Computer anxiety predicted knowledge 
acquisition (8 = —.18, p < .01) and knowledge application 
(8 = —.24, p < .01) but not working memory capacity (8 = —.01, 
p > .05). If the paths from computer anxiety to CPS and working 
memory capacity were constrained to equality, the model fit de- 
creased significantly (Model C.2), x°(2) = 9.027, p < .05. Results 
indicated that CPS was more strongly predicted by ICT literacy 
than was general cognitive ability. 

In summary, the findings of two out of the three studies dem- 
onstrated that CPS was not more strongly predicted by ICT literacy 
than was general cognitive ability. In detail, both behavioral and 
attitudinal operationalizations of ICT literacy impacted CPS in a 
manner that was similar to their impact on different assessments of 
figural reasoning. However, CPS was more strongly predicted by 
attitudinal measures of ICT literacy (i.e., computer anxiety) than 
was working memory capacity. 


Results for Research Question 3: Can the Added 
Value of CPS Be Explained by ICT Literacy or Are 
Distinct Cognitive Processes in CPS Responsible for 
the Added Value? 


Study A. In these analyses, we used GPA as a manifest 
variable. The first model with reasoning, practical computer 
knowledge, and computer anxiety as predictors of GPA (manifest) 
showed a good model fit (Model A.1 in Table 2). However, only 
reasoning (8 = .39, p < .01) and computer anxiety (8 = —.25, 
p < .05) significantly predicted GPA but practical computer 
knowledge did not (8 = .05, p > .05). Altogether, about 16% of 
the variance in GPA was explained in this model. In a second 
model (A.2), the residuals of CPS after controlling for reasoning 
and ICT literacy were added simultaneously to the predictors that 
were already included in the first model (see Figure 3 for an 
illustration of the type of model evaluated in Research Question 3). 
Again, GPA was significantly predicted by reasoning (8 = .40, 
p < .01) and computer anxiety (8 = —.25, p < .05) but not by 
practical computer knowledge (8 = .06, p > .05). Furthermore, 
the residuals of CPS after controlling for reasoning, computer 
anxiety, and practical computer knowledge predicted GPA beyond 
general cognitive ability and ICT literacy (residuals of knowledge 
acquisition: B = .24, p < .05; residuals of knowledge application: 
B = —.09, p > .05; see Figure 3). In the second model, 21% of the 
variance was explained by CPS, indicating that 5% of the variance 
was additionally explained in comparison to the first model. 

Study B. The measurement model for school marks as the 
criterion in these analyses with the two dimensions natural sci- 
ences and social sciences along with languages showed a good 
model fit, x7(26) = 24.03, p > .05; CFI = 1.000; TLI = 1.000; 
RMSEA = .000; WRMR = .528. 

Beginning with a model in which final school marks in the 
natural sciences were significantly predicted by reasoning (B = 
48, p < .01) but not by basic computer skills (8 = —.04, p > .05) 


'The maximum difference in path size was between knowledge appli- 
cation in CPS and reasoning. Therefore, we tested another more conser- 
vative model. However, the results for this more conservative model did 
not change if just the paths from computer anxiety to knowledge applica- 
tion and reasoning were constrained to equality, whereas the path to 
knowledge acquisition was freely estimated, y7(1) = 3.252, p > .05. 
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Table 2 
Fit Indices for Different Models for Research Question 3 

Model x df Pp CFI ape RMSEA WRMR 
Model A.1: g, PRACOWI, and COMA predict GPA 49.714 39 NP 990 987 .035 .035* 
Model A.2: g, PRACOWI, COMA, and CPS predict GPA 134.39 105 .03 983 7 .036 041° 
Model B.1: g, BCS, and COMA predict school marks 142.33 125 14 984 980 020 TAI 
Model B.2: g, BCS, COMA, and CPS predict school marks 260.09 233 oll 980 BO 018 ANE 
Model C.1: MUN and COMA predict school marks 94.33 fill 03 987 984 029 697 
Model C.2: MUN, COMA, and CPS predict school marks 193.98 IS .02 984 980 .025 695 





Note. 


x and df estimates are based on maximum likelihood (ML; Models A.1 and A.2) and weighted least squares mean- and variance-adjusted estimator 


(Models B.1 to C.2 because of ordered categorical school marks), respectively. CFI = comparative fit index; TLI = Tucker-Lewis Index; RMSEA = 
root-mean-square error of approximation; WRMR = weighted root-mean-square residual; g = Reasoning; PRACOWI = Practical Computer Knowledge; 
COMA = Computer Anxiety; GPA = grade-point average; CPS = complex problem solving; BCS = Basic Computer Skills; MUN = memory updating 


numerical. 


“ Standardized root-mean-square residual because of ML estimator for Models A.1 and A.2. 


or by computer anxiety (8 = —.01, p > .05), and school marks in 
the social sciences and languages were predicted only by computer 
anxiety (8 = —.20, p < .05) but not by reasoning (8 = .17, p > 
.O5) or by basic computer skills (8 = .14, p > .05), we found 21% 
explained variance in school marks in the natural sciences and 
8% explained variance in school marks in the social sciences 
and languages. The overall model fit was good (see Model B.1 
in Table 2). In a second model with good overall model fit 
(Model B.2 in Table 2) and with the CPS residuals as an 
additional predictor, there was no substantial change in the 
pattern of results for reasoning (natural sciences: B = .48, p < 
.O1; social sciences and languages: B = .15, p > .05), basic 
computer skills (natural sciences: B = —.04, p > .05; social 
sciences and languages: B .15, p > .05), and computer 
anxiety (natural sciences: B = .00, p > .05; social sciences and 





(.08) 


languages: B = .20, p < .05). Additionally, the residuals of 
knowledge application significantly predicted school marks in 
the natural sciences (8 = .18, p < .05), but no variance in 
school marks in the social sciences and languages was incre- 
mentally predicted by CPS. Overall, 24% of the variance in the 
natural sciences and 8% of the variance in the social sciences 
and languages were explained by the second model, indicating 
that 3% of the variance in the natural sciences and 0% in the 


social sciences and languages were additionally explained when 
CPS was included as a third predictor. 

Study C. The measurement model for school marks as the 
criterion in these analyses with the two dimensions natural sci- 
ences and social sciences showed a good model fit, Nis 
19.56, p =. 05; CFI = .986; TLI = 973; RMSEA —_ 036; 
WRMR = .637. 
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Figure 3. 





Model A.2 for Research Question 3. Knowledge acquisition and knowledge application were 


regressed on reasoning, practical computer knowledge (PRACOWI), and computer anxiety (COMA). The 
computer problem-sovling residuals (RES) of this regression as well as reasoning, practical computer knowl- 
edge, and computer anxiety were used to predict grade-point average (GPA). Parcels are not depicted. Standard 


errors are in parentheses. ~ p < .05. “ p < .001. 
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The first model with working memory capacity and computer 
anxiety as predictors of annual school marks showed a good model 
fit (Model C.1 in Table 2). Working memory capacity significantly 
predicted natural science school marks (6 = .26, p < .O1), but 
computer anxiety did not (B = —.02, p > .05). Similar results held 
for the social sciences. There, working memory capacity was a 
significant predictor (B = .11, p < .05), but computer anxiety was 
not (8 = .00, p > .05). For school marks in the natural sciences, 
7% of the variance was explained, and for social science school 
marks, 1% was explained. If CPS residuals were included in the 
second model (Model C.2) as additional predictors, only the re- 
siduals of knowledge acquisition significantly predicted marks in 
the natural sciences (8 = .25, p < .01) and social sciences (8 = 
.26, p < .01). There was no change in the pattern of results for 
working memory capacity (natural sciences: B = .26, p < .O1; 
social sciences and languages: B = .11, p < .05) and computer 
anxiety (natural sciences: B = —.02, p > .05; social sciences and 
languages: B = .00, p > .05). In the second model, 18% of the 
variance in natural science school marks and 7% of the variance in 
social science school marks were explained, indicating that 11% 
and 6%, respectively, were additionally explained by including the 
CPS residuals in comparison to the first model. 

Overall, all three studies demonstrated an added value of CPS 
beyond different operationalizations of general cognitive ability 
and ICT literacy. In detail, CPS additionally explained up to 11% 
of the variance in academic achievement. As indicated by the 
findings of Studies B and C, CPS was a stronger predictor of 
academic achievement in the natural sciences than in the social 
sciences and languages. 


Discussion 


The aim of the current study was to deepen the understanding of 
how individual CPS skills, which are currently receiving consid- 
erable interest in educational contexts as a highly relevant trans- 
versal skill (Mayer & Wittrock, 2006), are influenced by ICT 
literacy. To cover the construct comprehensively, this question 
was pursued in three different samples with a number of different 
measures of ICT literacy. Furthermore, we controlled for different 
general cognitive abilities such as reasoning and working memory 
when relating CPS and ICT literacy. In general, the results of our 
study supported the assumption that an assessment of CPS allows 
complex cognitive processes to be captured. These processes are 
related to ICT literacy and general cognitive ability to a certain 
extent but not exclusively so. More specifically, CPS skills were 
weakly to moderately related to ICT literacy (Research Question 
1). However, the relations between CPS and different assessments 
of ICT literacy were (with one exception) just as strong as between 
general cognitive ability and ICT literacy (Research Question 2). 
Most importantly, we were able to determine that the added value 
of CPS recently reported in the literature is not attributable to an 
indirect assessment of ICT literacy in CPS measures. That is, ICT 
literacy assessment and CPS assessment were not confounded in 
such a way that the validity of the latter was threatened. In fact, 
when controlling for general cognitive ability and ICT literacy, the 
incremental validity of CPS in predicting relevant external criteria 
remained substantial (Research Question 3). 

In accordance with previous research (Hartig & Klieme, 2005; 
Sii®, 1996), we found a noteworthy relation between CPS and ICT 


literacy. We can therefore repeat Mead and Drasgow’s ( 1993) 
word of caution that the influence of ICT literacy in CBA should 
not be underestimated and that any computer-delivered measure- 
ment instrument should be carefully designed and examined. This 
may be even more important for assessment instruments that 
reflect rather complex skills such as CPS or serious games (cf. 
Michael & Chen, 2006; Russell et al., 2003) that require a some- 
what more complex graphical user interface. However, in contrast 
to recently reported results by Sonnleitner et al. (2013), the added 
value of CPS independent of ICT literacy was demonstrated con- 
sistently in three studies. Sonnleitner et al. (2013) reported an 
added value of CPS only if the assessment of academic achieve- 
ment as a criterion was computer based, an idea that indirectly 
suggests a strong effect of ICT literacy. A reason for the discrep- 
ancy between studies could lie in the different operationalizations 
of CPS. GeneticsLab, the CPS assessment used by Sonnleitner et 
al. (2013), requires more advanced human—computer interactions 
than MicroDYN concerning the documentation of acquired knowl- 
edge and thus, arguably, an even higher level of ICT literacy. In 
MicroDYN, students are asked simply to draw arrows between 
variables on a concept map displayed at the bottom of the screen 
(see Figure 1), whereas in the GeneticsLab, the concept map is 
presented in a separate display, and students are asked to draw the 
relations and to label the strengths of the relations between the 
variables on a more comprehensive and also more complicated 
concept map. As a consequence, the GeneticsLab has a longer 
instruction phase, several user-interface environments for different 
CPS dimensions, more differentiated knowledge inquiry, and so 
forth. These features that are characteristic of CPS may increase 
the validity of the CPS assessment but at the same time may also 
increase the potential impact of ICT literacy and, thus, the predic- 
tion of computer-based external criteria as reported by Sonnleitner 
et al. (2013). Taking into account the different results concerning 
the influence of ICT literacy on CPS, we conclude that the impact 
of ICT literacy depends on the operationalization of CPS and the 
specific implementation of the assessment. For MicroDYN as a 
CPS assessment tool, ICT literacy was no threat to its validity. 
However, this article’s purpose, which was to examine the impact 
of ICT literacy, should be considered for every new operational- 
ization of CPS. 

This study was driven by two mutually exclusive explanations 
for the added value of CPS. It was argued that either (a) CPS 
assessment captures unique characteristics and complex cognitive 
processes that are not inherent to general cognitive ability or (b) 
the assessment of CPS is a confounded assessment of general 
cognitive ability and ICT literacy. Our findings did not support the 
second explanation. With regard to the incremental validity of CPS 
in particular, we came to the conclusion that the assessment of CPS 
allows researchers to consider complex cognitive processes be- 
yond general cognitive ability (cf. Raven, 2000), indicated by the 
finding that up to 11% of the variance in academic achievement 
was additionally explained by CPS beyond the variance explained 
by general cognitive ability and ICT literacy even though the 
prediction of social science and language grades was considerably 
lower. However, the overall result pattern provides important 
evidence for the validity of CPS in line with recent research (Greiff 
et al., 2012; Greiff, Wiistenberg, et al., 2013; Wiistenberg et al., 
2012). 
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Furthermore, we discuss two details of our findings more spe- 
cifically. First, with regard to the different operationalizations of 
general cognitive ability, we found a stronger influence of the 
attitudinal component of ICT literacy on CPS than on working 
memory. In fact, the MUN tasks that were used to assess working 
memory in Study C were not related to computer anxiety at all. 
Working memory is supposed to be a general, albeit basic, cogni- 
tive ability (cf. Oberauer, Sii8, Wilhelm, & Wittmann, 2008), and, 
thus, its assessment requires just simple human—computer inter- 
actions. In contrast to MicroDYN, for which students are asked to 
use several input and display elements (e.g., sliders, concept maps, 
diagrams) in different user interfaces, the only computer interac- 
tion in the MUN task is indeed just to successively press a number 
key on the keyboard to interact within a simple and uniform 
graphical user interface. Thus, we argue that a less complex 
human—computer interaction will be less influenced by ICT liter- 
acy (.e., at least by the attitudinal component of ICT literacy). 
Therefore, our findings are rendered even more powerful because, 
despite increased interactions within the user interface of our CPS 
assessment (cf. Figure 1), the predictive validity of CPS was not 
reduced in any of the three studies. 

The second detailed result that is worth mentioning is the 
differential predictive power of the two CPS dimensions: knowl- 
edge acquisition and knowledge application. In previous research 
(e.g., Schweizer et al., 2013; Wiistenberg et al., 2012), knowledge 
acquisition was the stronger predictor of academic achievement. 
However, our findings from Study B with regard to Research 
Question 3 showed a different pattern: Knowledge application was 
the strongest predictor rather than knowledge acquisition. There 
may be two different explanations for this result. First, the two 
dimensions are empirically separable but are highly correlated as 
found in previous studies (latent correlation around .70—.80). As a 
consequence, the differences in the relations of the two dimensions 
to external criteria could be due to a random capitalization on 
chance with knowledge acquisition being more strongly related to 
external criteria in some samples and knowledge application in 
others. Thus, a replication of the finding in Study B should be the 
next step taken to gain further insights into this issue. The second 
explanation addresses the demands that the CPS assessment places 
on students. As mentioned above, the difficulty of the CPS assess- 
ment was adjusted with regard to differences in the cognitive 
potential of the samples. In Study B, the study with the cognitively 
most able sample, the participants’ goal values when their knowl- 
edge application was assessed were more complex and interactive 
compared with in the two other studies. For example, to reach the 
given target values in knowledge application, more simultaneous 
inputs were necessary, and multiple targets had to be considered at 
the same time. Thus, in Study B, we might have captured complex 
cognitive processes by placing additional demands on the assess- 
ment of knowledge application. These additional demands may 
require processes beyond the complex processes that were already 
assessed by the knowledge acquisition dimension. Thus, the pre- 
dictive power of knowledge application surpassed that of the 
knowledge acquisition dimension. In general, we interpret this 
finding as an additional potential of the knowledge application 
dimension, and this has not yet been examined systematically. In 
conclusion, to increase knowledge about CPS, further research 
concerning the differential importance of knowledge acquisition 
and knowledge application is needed to better understand the 


differential predictive power of the two CPS dimensions (cf. 
Sonnleitner, Brunner, Keller, & Martin, 2014). 

Finally, some limitations of this article and outlooks for future 
research should be discussed. As noted above, we used broad 
operationalizations of ICT literacy and general cognitive ability; 
thus, the generalizability of our findings was not limited to single- 
assessment instruments of ICT literacy and general cognitive abil- 
ity. Specifically, reasoning and working memory are good indica- 
tors of general cognitive ability. However, they do not cover the 
entire range of general cognitive ability, which additionally en- 
compasses attention, long-term memory, processing speed, percep- 
tion, verbal ability, crystallized intelligence, and so forth. Further- 
more, ICT literacy is composed of a variety of aspects. Not all of 
them were covered in our study. However, our results provided 
indications of differential effects of ICT literacy on cognitive 
abilities. For example, computer anxiety as part of ICT literacy had 
an influence on CPS but not on working memory capacity. There- 
fore, future research should explore other cognitive abilities and 
more diverse operationalizations of ICT literacy when relating 
them to CPS (cf. Wittmann & Sti®, 1999). Additionally, only 
MicroDYN was used as an assessment of CPS. To expand the 
nomological network, several operationalizations are necessary. 
Regarding this, the approaches of finite state automata (Buchner & 
Funke, 1993) or classical microworlds (Funke, 2001) would be 
worthwhile extensions to MicroDYN. To this end, not only on the 
level of operationalizations but also with regard to theoretical 
considerations, additional efforts are needed to more comprehen- 
sively understand CPS in the context of cognitive theories such as 
the theory of situated cognition (Brown, Collins, & Duguid, 1989) 
or CHC theory (McGrew, 2009). In conclusion, further operation- 
alizations of general cognitive ability and CPS should be consid- 
ered and theoretical considerations should be made in future re- 
search, 

We put forward two explanations for the added value of CPS: 
the additional assessment of complex cognitive processes beyond 
general cognitive ability and a confounded assessment of general 
cognitive ability and ICT literacy. However, one could invoke 
other factors that may account for the added value of CPS, for 
example, motivation. The issue of motivation and acceptance has 
been discussed since the beginning of CPS assessment (cf. Kerst- 
ing, 1998; Sonnleitner et al., 2012; Vollmeyer & Funke, 1999). 
Gamelike features and attractive graphical setups may enhance 
motivation, which in turn may explain the added value of CPS. In 
our studies, we used attitudinal measures of ICT literacy, but we 
did not include motivational aspects in our research. If we want to 
be sure that the added value of CPS is mainly based on complex 
cognitive processes, it will be necessary to extend our research 
strategy to comprehensively consider other possible explanations 
such as motivation as well. 

We used three different samples for this research: high-ability 
university students, university students with a broad study back- 
ground, and high school students. Although these samples covered 
three relevant groups from the population of students, they are not 
completely representative of the entire population. With regard to 
generalizability (Brennan, 1983), we note that our results may be 
biased by low variability and may be different in other subgroups 
(e.g., in adults). In this sense, today’s students are described as 
“digital natives” (Prensky, 2001), indicating a generally high level 
of ICT literacy and, thus, restricted variance. However, the impact 
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of ICT literacy may not be linear across the entire range of ability. 
On the one hand, poor ICT literacy (i.e., not being familiar with the 
operations necessary to handle the computer) may cause severe 
difficulties, but on the other hand, a high level of ICT literacy may 
not be helpful in solving CPS tasks. It is important to consider such 
a nonlinear relation (cf. Leutner, 2002) for a deeper understanding 
of the differentiated impact of ICT literacy on CPS. However, such 
analyses need more heterogeneous samples, which should be con- 
sidered in future research. 

In conclusion, our aim was to extend the understanding of the 
assessment of CPS and, thus, to make a contribution to the vali- 
dation of the CBA of transversal skills. CPS as one of the most 
promising of these skills is an important part of international 
educational large-scale assessments such as PISA. The results of 
these large-scale assessments have extensive and substantial im- 
plications, for example, on further developments of educational 
systems. Thus, we believe that our research will lead to a better 
understanding of the results of educational large-scale assessments 
and, hopefully, to well-founded decisions that are aimed at bene- 
fitting students’ transversal skills in a quickly changing world. 
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Educational systems currently face two major challenges. First, 
pressure has risen to include and assess cross-curricular compe- 
tencies within new educational curricula (Elliot Bennett, Jenkins, 
Persky, & Weiss, 2003; Kuhn, 2009; Ridgway & McCusker, 
2003). The computer-based assessment of students’ complex 
problem-solving skill (CPS) has been suggested as a possible route 
to addressing this alluring but diffuse set of abilities (Greiff, 
Kretzschmar, Miiller, Spinath, & Martin, 2014; Greiff et al., 2013). 
CPS describes the competency to adequately interact with domain- 
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general problems in order to gather and successfully apply knowl- 
edge to reach certain target states (e.g., Buchner, 1995). This 
applied, domain-general character implies that cognitive processes 
related to CPS can be used in very different content areas and thus 
makes CPS a central example of cross-curricular competencies. 
Initial results concerning a psychometrically sound and reliable 
assessment of CPS within the educational context have been 
promising (Greiff et al., 2013; Sonnleitner et al., 2012). A huge 
step in this direction has also been taken through the inclusion of 
CPS measures in one of the most significant international large- 
scale assessments, the Program for International Student Assess- 
ment (PISA; Leutner, Fleischer, Wirth, Greiff, & Funke, 2012; 
OECD, 2010). However, since the development of CPS assess- 
ment instruments relies on fairly recent advances in computer- 
based assessment, research on this topic is still relatively scarce. 
A second challenge facing educational systems today is finding 
appropriate ways to respond to the specific sociocultural and 
socioeconomic needs of increasing numbers of students with im- 
migration background (Meunier, 2011; OECD, 2012). The Organi- 
sation for Economic Co-operation and Development (OECD) has 
argued that only when a country’s educational system succeeds in 
adequately integrating immigrant students can these students fully 
develop their potential to participate in a society's social and 
economic life (OECD, 2012). Yet most immigrant students lag 
behind their nonimmigrant peers in various academic subjects 
(e.g., mathematics; OECD, 2012; Schleicher, 2006). A possible 
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reason can be seen in the specific challenges they face, such as 
being educated in a language different to their mother tongue 
(OECD, 2012). 

Consequently, if the teaching and assessment of cross-curricular 
competencies such as CPS is to be established as an official part of 
school curricula, the consideration of immigrant students is vital. 
So far, nothing is known about whether CPS measures are fair with 
respect to immigration background or whether performance dif- 
ferences exist between immigrant students and their native peers. 

The present study attempts to fill this gap by thoroughly explor- 
ing whether immigrant students are disadvantaged on measures of 
CPS or not. Specifically, we investigated whether an established 
measure of CPS (i.e., the Genetics Lab; Sonnleitner et al., 2012) is 
measurement invariant and thus fair with respect to immigration 
background. Only measurement invariance ensures (a) that the test 
works equally for students with and without immigration back- 
ground, (b) that the same construct 1s measured in both groups, and 
(c) that any performance differences between these groups are due 
to actual differences in the construct being measured, and not to 
bias or error produced, for instance, by construct-irrelevant cul- 
tural differences (Little, 1997; Widaman & Reise, 1997). In a 
consecutive step, we attempted to determine whether and to what 
extent performance differences in individual facets of CPS exist 
for students of varying immigration backgrounds. 

To this end, we drew on a sample of ninth grade students of 
differing immigration backgrounds enrolled in different academic 
tracks. The study took place in Luxembourg, a country known for 
its high ratio of immigrant students (Burton & Martin, 2008; 
OECD, 2012). This heterogeneous sample may make results par- 
ticularly relevant on an international level, since high immigrant 
rates can be found in countries around the globe (OECD, 2012). 

In sum, the present study addresses the gap in our understanding 
of immigrant students’ CPS performance by investigating (a) 
fairness of CPS assessment with regard to immigration back- 
ground and (b) performance differences between immigrant and 
native students in individual facets of CPS. 


Complex Problem Solving: Computer-Based 
Assessment and Performance Scores 


Students’ skill to solve complex problems is typically assessed 
by computer-based microworlds, such as the Genetics Lab (Sonn- 
leitner et al., 2012; Sonnleitner, Keller, Martin, & Brunner, 2013) 
shown in Figure |. Crucially, such microworlds (a) incorporate 
several characteristics that also describe problems of high com- 
plexity in everyday life and (b) require the students to acquire 
knowledge about the problem in order to purposefully apply it to 
reach certain target states of the problem (cf. Funke, 2010). 

For better illustration, the Genetics Lab is depicted in Figure 1. 
In the first, knowledge acquisition phase (Figure 1a), students are 
asked to imagine that they are researchers in a genetics lab where 
they can manipulate several genes of fictitious creatures in order to 
study how these genes are related to several characteristics of the 
creatures. Some of these characteristics additionally change on 
their own, that is as a function of time. Students can depict the 
knowledge they have gathered about the relations between genes 
and creature characteristics in a creature-specific database (Figure 
lb) by means of a causal diagram. In the second, knowledge 
application phase (Figure 1c), students must apply the gathered 


knowledge in order to achieve several target states of the charac- 
teristics within a given number of manipulations. Note that the 
semantic embedding is entirely fictive, making only very low 
demands on previous knowledge (Greiff, Wiistenberg, & Funke, 
2012). 

Typically, the Genetics Lab provides three scores, which reflect 
different facets of students’ complex problem solving behavior. 
The first, rule identification, describes the quality and efficiency 
with which students explore the given problem. Some exploration 
strategies are more informative than others. For example, it is more 
informative to manipulate only one gene and then study this 
manipulation’s effects on the creature’s characteristics than to 
manipulate several genes at the same time. Simultaneously manip- 
ulating multiple genes means that their individual effects are 
intermingled and can no longer be unambiguously identified 
(Vollmeyer, Burns, & Holyoak, 1996). The second score reflects 
students’ skill to express their gathered rule knowledge! within a 
causal diagram. Compared to other microworlds assessing CPS 
(e.g., MicroDYN; Greiff et al., 2012; Wiistenberg, Greiff, & 
Funke, 2012), the Genetics Lab allows for a more differentiated 
assessment of rule knowledge. Students not only have to show 
relational knowledge by indicating whether a causal relation exists 
between a given gene and a certain characteristic; they also have to 
demonstrate knowledge about the type (increasing or decreasing) 
and strength (weak or strong) of this relation (Blech & Funke, 
2005). The third score, rule application (see footnote 1) relates to 
the students’ skill to utilize the gathered knowledge in order to 
achieve the given target values on the creatures’ characteristics. 
Since the number of available manipulations is limited, students’ 
skill to plan, make forecasts, and react to unexpected consequences 
comes into play (Funke, 2003). 

Currently, there is no consensus on how to best represent the 
various phases of the problem solving process psychometrically. 
Several researchers, for example, consider CPS to be a multidi- 
mensional construct. Thus, they distinguish facets corresponding 
to all or a subset of the phases and derive scores for rule identi- 


fication, rule knowledge, and rule application. Interestingly, many 


studies in the educational domain that have drawn on student 
samples have only obtained scores for the facets of rule knowledge 
and rule application (Bihner, Kroner, & Ziegler, 2008; Greiff et 
al., 2013; Wiistenberg, Greiff, Molnar, & Funke, 2014). When the 
third facet of rule identification has been measured, its indepen- 
dence from the other two facets has been thrown into question. 
Whereas Kroner, Plass, and Leutner (2005), as well as Sonnleitner 
et al. (2013) could reliably measure and discriminate between all 
three facets of CPS, Schweizer, Wiistenberg, and Greiff (2013) 
found rule identification to be identical to rule knowledge and thus 
redundant. Irrespective of whether two or three facets of CPS were 
distinguished, all previous studies have shown that the facets of 
CPS are strongly interrelated. The better students’ skill to explore 
a problem, the higher their acquired knowledge, and the better 
their skill to reach given target values is. Further, more acquired 
knowledge is linked to a better skill to reach given target values. 


' Please note that in several studies that do not assess rule identification, 
the scores rule knowledge and rule application are labeled according to the 
related problem solving phases, knowledge acquisition and knowledge 
application. 
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A) Rule Identification 
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Figure 1 (opposite) 
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Correlations (adjusted for measurement error) between the facets 
have typically been well above .60 (Greiff et al., 2013; Kroner et 
al., 2005; Schweizer et al., 2013; Sonnleitner et al., 2013; Wiisten- 
berg et al., 2012). 

Whereas those researchers who consider CPS to be a multidi- 
mensional construct derive only facet-specific scores, other re- 
searchers are solely interested in a global CPS score (e.g., PISA 
2012; Abele et al., 2012; OECD, 2010). In particular, these latter 
researchers conceive of CPS as a hierarchical construct where a 
general CPS skill explains the intercorrelations among the facets 
(Sonnleitner et al., 2013). In this theoretical framework, a higher 
general CPS skill leads to better performance in all phases of the 
problem solving process. Since the question of which psychomet- 
ric conceptualization (multidimensional vs. hierarchical) repre- 
sents CPS best and provides the most value for applied settings is 
still open to debate (cf. Sonnleitner et al., 2013), the current study 
draws on both conceptualizations. 


Measurement Invariance and Group Differences 
in CPS 


The central prerequisite for a fair comparison of performance 
differences in groups is that the administered measurement instru- 
ment essentially measures the same construct in both groups and 
thus is measurement invariant or measurement equivalent (see 
Little, 1997, or Widaman & Reise, 1997). According to Little 
(OO, D: SO), 


Measurement equivalence (strong factorial invariance) indicates that 
(a) the constructs are generalizable to each sociocultural context, (b) 
sources of bias and error (e.g., cultural bias, translation errors, varying 
conditions of administration) are minimal, (c) cultural differences 
have not differentially affected the constructs underlying measure- 
ment characteristics [. . .], and (d) between-culture differences in the 
constructs’ mean, variance, and covariance relations are quantitative 
in nature (i.e., the nature of cultural differences can be assessed as 
mean-level, variance, and covariance or correlational effects). 


Consequently, a comparison of manifest test scores or latent means 
across groups is justified and fair only if strong factorial measure- 
ment invariance holds. Whether measurement invariance is tenable 
for a certain measure is usually explored in a stepwise procedure, 
increasingly constraining model parameters to be the same across 
groups and investigating whether this significantly impacts model 
fit (cf. Little, 1997; Widaman & Reise, 1997). 

The investigation of measurement invariance regarding CPS is 
still at its beginning. Importantly, no studies have yet investigated 





Figure I (opposite). 


measurement invariance for all facets of CPS or general CPS for 
students of differing immigration backgrounds. Nevertheless, there 
are some studies that provide promising results. The assessment of 
rule knowledge and rule application has been found to be mea- 
surement invariant and thus fair across sex, nationalities, and 
grade-levels (Greiff et al., 2013; Wiistenberg et al., 2014). For the 
facet of rule identification, however, measurement invariance has 
yet to be established. 

Group comparisons (i.e., mean comparisons after measurement 
invariance has been established) of rule knowledge and rule ap- 
plication have shown a strong influence of educational background 
(Greiff et al., 2013; Wiistenberg et al., 2014). In general, higher 
grade level and highest attended educational level corresponded 
with better performance in rule knowledge and rule application. In 
a cross-cultural study reported by Wiistenberg et al. (2014), Hun- 
garian high school students were slightly outperformed by their 
German counterparts in rule knowledge and rule application. 


Reasons for Expected Performance Differences in CPS 
Skill Due to Students’ Immigration Background 


Despite positive attitudes toward learning and school, immigrant 
students perform significantly worse than their nonimmigrant 
peers on mathematics, reading, science and (paper-pencil based 
measures of) problem solving skills as assessed in PISA 2003 
(Martin, Liem, Mok, & Xu, 2012; Schleicher, 2006). There are 
several additional reasons why performance differences in CPS 
skill might be expected between immigrant students and their 
nonimmigrant peers. First, especially for the facet of rule identi- 
fication, evidence suggests that culture-specific differences may 
influence performance. Two cross-cultural studies revealed that 
exploration strategies in complex problems strongly vary between 
countries (Giiss, Tuason, & Gerhard, 2010; Strohschneider & 
Giiss, 1999). A comparison of university students from Germany, 
Brazil, India, the Philipines, and the United States showed country- 
specific differences in performance and exploration behavior in a 
(computer-based) complex problem solving environment (Giiss et 
al., 2010). Think-aloud techniques and qualitative analyses of 
verbal protocols were able to show that country-specific problem 
solving strategies led to differences in the amount of gathered 
information, the way problems were investigated, and the demon- 
strated planning and decision making behavior. Reasons for 
country-specific problem solving styles were seen in “environmen- 
tal and culture-based differences such as context, resource avail- 
ability, and emotional expressiveness” (Giiss et al., 2010, p. 510). 
This is in line with the results reported by Wiistenberg et al. (2014) 


Screenshots of the different demands students have to solve within the Genetics Lab, a microworld used to assess complex problem 


solving. A. Rule Identification: In a fictive genetics lab, students have to manipulate genes (depicted in the diagrams on the left) and identify their effects 
on a creature’s characteristics (depicted in the diagrams on the right). At any time, they can switch to the database to depict the gathered knowledge by 
clicking on the button in the upper left corner of the screen. B. Rule Knowledge: In a creature related database, students can depict their gathered knowledge 
by means of a causal diagram. Arrows pointing from genes to characteristics represent a causal effect and indicate the strength (weak or strong) and direction 
(increasing or decreasing) of this effect. At any time, students can click on the help button in the upper right corner of the screen. C. Rule Application: 
Students have to apply the gathered knowledge to achieve given target values on the creature’s characteristics (indicated by the horizontal lines). 
Importantly, they only have a limited number of manipulations to do this. Reprinted from “The Genetics Lab: Acceptance and Psychometric Characteristics 
of a Computer-Based Microworld Assessing Complex Problem Solving,” by P. Sonnleitner, M. Brunner, S. Greiff, J. Funke, U. Keller, R. Martin, et al. 
2012, Psychological Test and Assessment Modeling, 54, p. 59, Figure 1. Copyright 2012 by Pabst Science. 
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that cross-country differences between Germany and Hungary in 
the facets of rule knowledge and rule application were largely 
attributable to a poor exploration strategy among Hungarian fe- 
male students, again pointing to the importance of the facet of rule 
identification. Note, however, that no results on the measurement 
invariance of rule identification were reported. 

Another reason why performance differences in CPS might be 
expected for students with immigration background lies in the 
importance of these skills for the special situation these students 
are in. According to Martin et al. (2012), problem solving skills are 
crucial for the academic development of immigrant students, as 
they might compensate a lack of country-specific prior curricular 
knowledge or educational experience. In an excellent study draw- 
ing on the PISA 2003 data set, Martin et al. supported these claims 
by showing that nonimmigrant students generally outperformed 
their immigrant peers on (paper-pencil based) measures of prob- 
lem solving, mathematics, and science. Moreover, students’ prob- 
lem solving skills were found to strongly relate to students’ 
achievement in mathematics and science, underscoring the cross- 
curricular importance of problem solving skills for students with 
immigration background. Similar to academic subjects, the micro- 
worlds used to assess CPS are not entirely context-free. Due to the 
semantic embedding of most microworlds, immigrant students 
may be disadvantaged if they lack culture-specific knowledge that 
is (unintentionally) tapped by the administered microworlds. Even 
an abstract representation of variables or the use of causal dia- 
grams makes strong demands on culture-specific knowledge that 
might impair immigrant students’ performance in these micro- 
worlds (see also van de Vijver, 2008, who discussed cross-cultural 
differences of test understanding and previous test exposure even 
for reaction-time tests). 

A third reason for expected performance differences is based on 
several studies that have shown that language proficiency in the 
instructional language predicts performance even in largely 
“Janguage-independent” subjects, such as mathematics (Kempert, 
Saalbach, & Hardy, 2011; Levin & Shohamy, 2008). Moreover, 
when the language in which knowledge acquisition took place 
differs from the language used for knowledge application, “cog- 
nitive costs” arise, impairing students’ performance (Kempert et 
al., 2011; Saalbach, Eckstein, Andri, Hobi, & Grabner, 2013). As 
microworlds are novel and complex tasks, they usually come with 
written instructions of varying length (Rollett, 2008). Thus, lan- 
guage proficiency plays an important role in these tasks, even more 
so when students whose mother tongue differs from the test 
language are forced to switch language because the language (or 
“self-talk,” i.e., their mother tongue) in which they explore and 
investigate a problem (see Giiss et al., 2010) differs from the 
language in which the complex problem is presented. 


Possible Benefits of Computer-Based Microworlds for 
Immigrant Students 


In contrast to the factors that might impair immigrant students’ 
performance in CPS tasks, computer-based assessment offers sev- 
eral possibilities that might counter some of these effects or even 
answer special needs of immigrant students in a way that would 
not be possible in traditional paper-pencil based assessment instru- 
ments. First, microworlds might offer the opportunity to identify 
immigrant students’ cognitive potential despite their educational 


background. Conventional (paper-pencil based) measures of higher order 
cognitive skills, such as reasoning tests, have been found to be posi- 
tively influenced by attending the academic (i.e., a higher) school 
track (Becker, Liidtke, Trautwein, Koller, & Baumert, 2012; 
Gustafsson, 2008). In such tasks, all the information that is needed 
to deduce the correct solution is given at the outset and does not 
need to be generated through interactions with the problem (Rol- 
lett, 2008). In contrast, microworlds demand that the students 
interact with a problem and apply adequate exploration strategies 
to generate and gather information. Crucially, 81% of Luxembourg 
students report that they never or hardly ever spend time in 
laboratories or have to think of appropriate ways to solve science 
problems by conducting experiments (MENFP, SCRIPT, Univer- 
sité du Luxembourg, & EMACS, 2007). Thus, microworlds might 
be less affected by educational background than traditional tests, as 
their task demands are fairly novel and rarely trained in school. 
This, in turn, might offer the unique opportunity to identify im- 
migrant “underachievers,” since immigrant students often attend a 
nonacademic track that is not appropriate to their cognitive poten- 
tial due to a lack of language skills (see, e.g., Burton & Martin, 
2008, or Klapproth, Glock, Krolak-Schwerdt, Martin, & Bohmer, 
2013). 

Second, in an attempt to minimize the influence of students’ 
language background on their CPS scores, computer-based assess- 
ment makes it possible for each student to take the test in his or her 
preferred language. In contrast to paper-pencil based measures, a 
student might even switch the language of the test while working 
on it. Thus, when CPS is assessed on the computer using a 
microworld that includes an option to switch to a preferred lan- 
guage, students’ language background should not affect perfor- 
mance, or at least affect it to a lesser degree (relative to paper- 
pencil measures). 

Third, the latest generation of microworlds (see, e.g., the Ge- 
netics Lab; Sonnleitner et al., 2012; Sonnleitner, Keller, Martin, 
Latour, & Brunner, in press) avoids extensive written instructions 
and instead draws on interactive instructions to explain task de- 
mands. Since these demands are also illustrated by animations and 
exercises, instructions only contain a minimum of text. A help- 
button that is present throughout the students’ interaction with the 
microworld ensures guidance in each phase of the assessment (in 
students’ preferred language). Thus, the impact of language back- 
ground and proficiency is potentially minimized. 


Aims of the Present Study 


Educational systems currently face two important challenges: 
(a) The requirement to assess cross-curricular competencies such 
as CPS, which often implies the use of computer-based assessment 
instruments, and (b) the need to gather empirical knowledge on 
performance gaps between students with and without immigration 
background in key cognitive competencies in order to support 
data-based educational policies. The present article significantly 
contributes to clarify both issues. First, we tackle the question of 
whether the Genetics Lab, an established computer-based micro- 
world to assess CPS, is fair with regard to immigration back- 
ground. This is an essential prerequisite to study any group-related 
performance differences. Second, drawing on both multidimen- 
sional and hierarchical conceptualizations of CPS, we explore 
performance differences in facets of CPS and general CPS be- 
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tween students of differing immigration backgrounds. In doing so, 
the present article is the first to address these key questions on the 
relations between students’ skill to successfully interact with 
domain-general problems and students’ immigration background. 


Method 


Participants 


The sample consisted of 299 Luxembourg ninth graders who 
were enrolled in two different secondary school tracks (i.e., non- 
academic/intermediate vs. academic school track). Detailed infor- 
mation about the sample is provided in Table 1. One hundred 
eighty-seven students were enrolled in the nonacademic track (96 
of them reporting immigration background), and 112 students were 
enrolled in the academic track (31 with immigration background). 
In total, 127 students reported having an immigration background 
(63 female; M age = 15.7 years, SD = 0.81), and 172 students 
were considered to be native Luxembourg students (83 female; M 
age = 15.4 years, SD = 0.68). Forty-eight of the immigrant 
students were born abroad with no parent born in Luxembourg 
(first generation immigrants, 1G; n = 48), and 79 were born in 
Luxembourg but reported that both parents were born abroad 
(second generation immigrants, 2G; n = 79). 

Although previous studies described performance differences 
between 1G and 2G students (Martin et al., 2012; Stanat, Rauch, & 
Segeritz, 2010), we pooled these groups for the following reasons. 
First, similar to 2G students, the vast majority of our sample’s 1G 
students spent their whole academic career in Luxembourg 
schools. Sixty percent (n = 27) of 1G students were already 
enrolled at age 4, when compulsory education in Luxembourg 
starts, and 77% (n = 35) were enrolled at age 6, when primary 
school starts. Thus, with regard to the main characteristic sus- 
pected to lead to performance differences, 1G and 2G immigrant 
students of our sample can be seen as fairly homogenous. Second, 
a two-samples f test did not reveal significant differences (a < 
0.05) between 1G and 2G students on the administered cognitive 
measures (see Table | for means and standard deviations), sup- 
porting the notion of homogeneity. Third, the proportion of stu- 
dents that speak one of Luxembourg’s official languages at home 
(.e., Luxembourgish, French, German) is similar in 1G (n = 23, 
48%) and 2G students (n = 39, 49%). As language is seen as a 
crucial factor strongly determining success in the Luxembourg 
school system (Burton & Martin, 2008; Klapproth et al., 2013), 
neither of these groups seems to have a related advantage. Note 
that a comparable number of students in both immigrant groups 
reported a cultural background similar to Luxembourg, that is they 
(1G) or both of their parents (2G) were born in neighboring 
countries (i.e., Belgium, France, Germany). Fourth, by grouping 
1G and 2G students, we also ensured comparability with previous 
Large-Scale Assessment studies that have applied the same group- 
ing (e.g., Schleicher, 2006). Fifth, we gained greater statistical 
power to detect and support potential effects between immigrant 
students and their native peers. 


Procedure 


The study was conducted with approval from the national Min- 
istry of Education and followed the ethical standards of the host 
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university. A presentation of the study’s results and feedback of 
the students’ performance was offered to the volunteering schools. 
Both students and parents received detailed written information 
about the scientific background and purpose of the study. None of 
the students or theirs parents took the given opportunity to refuse 
participation. 

The Genetics Lab, the reasoning scales, and a background 
questionnaire were administered by trained research assistants 
within 110 min (two school lessons) at school during regular class 
time. To ensure commitment, students were offered a prize for the 
two best students of each participating class and given detailed 
written feedback on their performance after completion of the 
study. 


Measures 


Complex problem solving. Students’ complex problem solv- 
ing abilities were assessed using the Genetics Lab (see Figure 1), 
a freely available, computer-based microworld (see http://www 
-assessment.lu/GeneticsLab and Sonnleitner et al., 2012). The Ge- 
netics Lab was found to be a reliable and valid measure of CPS 
that discriminates between and provides reliable scores for the 
CPS facets rule identification (RI), rule knowledge (RK), and rule 
application (RA; Sonnleitner et al., 2013). 

At the beginning of the Genetics Lab, students could choose 
between a German, a French, and an English version. Accord- 
ingly, instructions and animations were presented in the chosen 
language. Performance across scenarios was summarized by 
three scores reflecting students’ proficiency in the three main 
facets of complex problem solving; the scoring algorithms can 
be found in Keller and Sonnleitner (2012): (a) Each student’s 
exploration strategy was scored on the basis of a detailed 
log-file in which every interaction with the microworld was 
stored. Thus, it was possible to derive a process-oriented mea- 
sure (rule identification) indicating how efficiently a student 
explored a scenario by relating the number of informative 
exploration steps to the total number of steps applied (Kroner et 
al., 2005). Note that an exploration step is most informative if 
students manipulate the genes in a way that allows any changes 
in characteristics to be unambiguously attributed to a certain 
gene (see Vollmeyer et al., 1996). (b) Students‘ rule knowledge 
was assessed by scoring their database records (see Figure 1b) 
using an adaptation of an established scoring algorithm (see 
Funke, 1992). The resulting rule knowledge score thus reflects 
knowledge about how a gene affects a certain characteristic of 
a creature and how strong this effect is. (c) Finally, the actions 
that students took to achieve certain target values on the crea- 
ture’s characteristics during the control phase (see Figure Ic) 
were used to compute a process-oriented rule application score. 
Only optimal steps (in the sense that the difference from the 
target values was maximally decreased) were considered to 
indicate good control performance. Given that all target values 
must be achieved within three steps, a maximum score of three 
was possible for each scenario. This approach overcomes the 
limitations of many previous scoring procedures by guarantee- 
ing that the scoring of a certain control step is completely 
independent of the preceding control steps. To facilitate the 
interpretation of the results, all subtest scores were expressed as 
percentage of the maximum possible score (POMP; see Cohen, 


Cohen, Aiken, & West, 1999), for which a value of 0 indicates 
the lowest possible score, and a value of 100 indicates the 
highest possible score (see Table | for descriptives). All scales 
showed satisfactory internal consistency, with Cronbach’s al- 
pha ranging from .75 for RA to .89 for RI and RK. 

When analyzing measurement invariance of facets of CPS, we 
created parcel scores (i.e., sum scores of subsets of items) in order 
to better capture the latent constructs. Compared to individual item 
scores, parcel scores are less prone to distributional violations and 
show higher reliability (Little, Rhemtulla, & Gibson, in press). 
Items that shared a theoretical and empirically supported second- 
ary influence beyond the latent construct they were supposed to 
measure, were combined into a parcel (see Hall, Snell, & Singer 
Foust, 1999). For each facet, three item parcels were created. 
Parcel 1 contained items with only two input variables (Items 1, 2, 
and 3), Parcel 2 contained items with three input variables (Items 
4, 5, 6, and 7), and Parcel 3 contained items with three input 
variables and variables that changed dynamically (Items 8, 9, 10, 
11, and 12). 

When we conceived of CPS as a hierarchical construct and 
analyzed measurement invariance of general CPS, we followed the 
aggregation strategy recommended by Bagozzi and Edwards 
(1998), using the sum scores of all three subscales as indicators of 
this general CPS skill. 

Reasoning. We administered classical paper-pencil measures 
of reasoning ability to serve as benchmark to explore specific 
advantages that a computer-based assessment of CPS might have 
for immigrant students. Specifically, two subtests of the Intelli- 
gence Structure Test IST-2000R (Amthauer, Brocke, Liepmann, & 
Beauducel, 2001), a reliable and valid measure of intelligence, 
were administered to measure students’ ability (a) to complete 
figural matrix patterns (time limit: 10 min; score MA, Figures 2 
and 3), and (b) to complete number series (10 min; score NC, 
Figures 2 and 3). Both scales were administered in paper-pencil 
format, and students could choose between a German, French, and 
an English translation of the instruction. Due to an error in the 
production process of the test booklets, five out of 20 figural 
matrix items could not be analyzed. Note that this loss of data was 
completely at random and affected items of the whole range of 
difficulty compared to the item difficulties that were obtained for 
the normative sample (see IST-2000R manual; Amthauer et al., 
2001). Analogous to the Genetics Lab, we expressed the score 
for general reasoning as POMP score with a value of 0 indi- 
cating the lowest, and a value of 100 indicating the highest 
possible score. Whereas internal consistency for number com- 
pletion was found to be satisfactory (a = .90), it was rather 
poor for the matrices (a = .48; compared to a = .71 for the full 
scale as reported in the manual; Amthauer et al., 2001); this 
might be explained by the lower number of items. Note, how- 
ever, that for the analyses of the present study (see below) we did 
not focus on the internal consistency of manifest scale scores but 
rather on the factor reliability of the (latent) common reasoning 
factor underlying both scales, which explained an acceptable 
44.3% of their variance (reliability of the reasoning factor score 
w = .44; see also Brunner, Nagy, & Wilhelm, 2012). 

Background questionnaire. A background questionnaire was 
administered including questions about the students’ and their 
parents’ demographic characteristics such as immigration back- 
ground, language spoken at home, and age. 
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Figure 2. Strict measurement invariant models to study measurement invariance and performance differences 
in complex problem solving (CPS) and reasoning in relation to students’ immigration background. The model 
in A focuses on general CPS, whereas the model in B considers three different facets of CPS. Rea = reasoning; 
MA = matrices; NC = number completion; RI = rule identification; RK = rule knowledge; RA = rule 
application; RII—RI3 = parcel scores of rule identification items; RK1—RK3 = parcel scores of rule knowledge 
items; RAI—RA3 = parcel scores of rule application items. Standardized model solution is shown. 


Statistical Analyses 


To study whether students with immigration background dif- 
fered in their problem solving abilities from their fellow students, 
we first ensured that the measurement properties of the Genetics 
Lab and the reasoning tests were invariant across students with 
differing immigration background. This was done using a stepwise 
approach based on multiple-group factor analytic models (Little, 
1997; Lubke, Dolan, Kelderman, & Mellenbergh, 2003; Widaman 
& Reise, 1997). Specific levels of measurement invariance (i.e., 
model constraints that were imposed in the different steps) are 
explained in the results section. All model parameters were esti- 
mated using Mplus 5.2 (L. K. Muthén & Muthén, 1998-2010) by 
means of the maximum likelihood estimator (ML) and all reported 


coefficients are based on standardized solutions. Measurement 
invariance for a hierarchical conceptualization of CPS including a 
general CPS skill factor (Figure 2a) was investigated with Models 
H1 to H4. Measurement invariance for a faceted conceptualization 
of CPS including the facets of rule identification, rule knowledge, 
and rule application (Figure 2b) was investigated with Models F/ 
to F4, The Type I risk a for data analyses was set at p < .05, 
two-tailed. 

On the basis of multiple criteria (Little, 1997; Widaman & 
Reise, 1997), we evaluated whether a certain level of measurement 
invariance could be assumed for the reasoning scales and the 
Genetics Lab. First, we consulted the x* goodness-of-fit statistic 
and several indices describing overall model fit: the comparative 
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Figure 3. Adjustment of mean differences due to migration background and academic track by means of a 
multiple-indicator, multiple-causes (MIMIC) model for reasoning and (A) general complex problem solving 
(CPS) and (B) facets of CPS. For reasons of clarity, not all correlations are shown. H = hierarchical; F = 
faceted; Rea = Reasoning; MA = matrices; NC = number completion; RI = rule identification, RK = rule 
knowledge; RA = rule application; RII—-RI3 = parcel scores of rule identification items; RK1-RK3 = parcel 
scores of rule knowledge items; RAI—RA3 = parcel scores of rule application items. Significant effects are 


depicted in bold. Standardized model solution is shown. 


fit index (CFI); the standardized root-mean-square residual 
(SRMR); and gamma, which is less sensitive to model size than the 
related and more popular root-mean-square error of approximation 
(Fan & Sivo, 2007). For a detailed description of these fit statistics, 
including formulas, please refer to Iacobucci (2010). Following 
recommendations by Hu and Bentler (1999), CFI and gamma 
values above .95 and SRMR values below .08 were considered as 
indicating good fit between hypothesized model and observed 
data. In a second step, we checked each model for local misspeci- 
fications. Residual correlations above .10 were considered to be 
problematic (cf. McDonald, 2010). Third, we evaluated whether a 
more restrictive form of measurement invariance (including addi- 


tional cross-group equality constraints) was tenable by inspecting 
the change in descriptive fit statistics. For CFI, a change of less 
than .01 was seen as acceptable (Cheung & Rensvold, 2002). For 
SRMR and gamma, we considered differences below .05 to indi- 
cate that cross-group equality constraints had little influence on 
model fit (see Little, 1997). Since the more restricted measurement 
models were nested within the less restricted ones, it was further 
possible to compute x? difference tests showing whether the mod- 
els significantly differed in model fit. Fourth, in evaluating the 
degree of measurement invariance, we also took into account the 
theoretical implications and parsimony of the models. After con- 
sulting the modifications indices of Mplus, we opted for the most 
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parsimonious and substantive model when local misfit (i.e., the 
difference between model implied means or covariances on the 
one side and actual data on the other) was negligible (Little, 1997; 
McDonald, 2010; Widaman & Reise, 1997). 

To determine (a) whether immigration background explained 
performance discrepancies in students’ problem solving abilities 
and (b) whether such a performance discrepancy was found in both 
reasoning and CPS, we compared the latent means of students with 
differing immigration background. For this purpose, we fixed 
latent means of students without immigration background to zero 
for identification purposes. Thus, the resulting latent means for 
students with immigration background represent differences in the 
construct due to immigration background. Using the obtained 
latent mean of these students and the pooled standard deviations, 
which were calculated from the latent standard deviations of both 
groups, we computed Cohen’s d with positive d values indicating 
that students with immigration background outperformed their 
peers without immigration background. 

To this end and to further study the importance of the attended 
academic track (see above), we ran two multiple-indicator, 
multiple-causes (MIMIC) models (Joreskog & Goldberger, 1975; 
B. O. Muthén, 1989). We adjusted mean differences due to immi- 
gration background and school track in a hierarchical (Figure 3, 
Model H-MIMIC) and a faceted CPS conceptualization (Figure 3, 
Model F-MIMIC), since the appropriate psychometric conceptu- 
alization of CPS is still debated. In both models the unique 
relations of academic track and immigration background were 
expressed as standardized effect sizes. Their joint relation to cog- 
nitive outcomes was expressed in terms of the amount of explained 
variance R°. 


Results 


Descriptives 


Descriptive statistics of the obtained performance scores accord- 
ing to immigration background and academic track are presented 
in Table 1.* As all scores were expressed as percentage of the 
maximum possible score (POMP; Cohen et al., 1999), a direct 
comparison between scores and groups is possible when direct 
inferences on psychometric properties of the scales are avoided. In 
general, the highest scores were obtained on rule knowledge, 
whereas the lowest scores were obtained on rule identification. No 
performance differences could be found between immigrant stu- 
dents and their nonimmigrant peers in the academic track, but 
nonimmigrant students slightly outperformed students with immi- 
gration background in the nonacademic track. A small difference 
in favor of immigrant students could be obtained only on the rule 
identification score, indicating that these students applied 3.2% 
more informative steps when exploring a problem. However, from 
a practical point of view, these differences are negligible. When 
comparing performance across school tracks, however, students 
enrolled in the academic track consistently outperformed students 
in the nonacademic track. Thus, immigrant students enrolled in the 
nonacademic track showed the lowest performance on all scales 
except rule identification (see Table 1). 


Measurement Invariance 


We first investigated measurement invariance for reasoning and 
CPS conceived as a hierarchical construct with a general CPS 
factor (Figure 2a). Model fit indices are given in Table 2. Although 
the x? test statistic was found to be significant for the baseline 
model that assumes the same factorial pattern across groups (i.e., 
configural measurement invariance; Model H1), descriptive fit 
indices suggested acceptable model fit. Further, no substantive and 
theoretically justified changes were indicated by local misfit or the 
modification indices provided by Mplus. Indeed, the observed 
correlational pattern between reasoning and general CPS mirrored 
patterns found in previous studies. Thus, configural measurement 
invariance was tenable for both constructs, and Model H1 served 
as benchmark for the subsequent analyses. Model H2, assuming 
weak invariance fitted the data equally well. Descriptive fit indices 
indicated good fit, and the x? difference to Model HI Ax’ was 
nonsignificant (p = .80). Even more restrictive constraints on 
model parameters in Model H3 representing strong measurement 
invariance had little influence on model fit. Compared to Model 
H2, only slight changes were found, with ACFI = —.01 and Ay? 
remaining nonsignificant. Although strong measurement invari- 
ance already ensures a fair comparison of latent factor means and 
variances across groups (Widaman & Reise, 1997), we further 
investigated whether strict measurement invariance was tenable 
(Model H4). Model fit indices suggested good fit to the data, 
change of x? compared to Model H3 was nonsignificant (p = .96) 
and all parameters were clearly interpretable and of substantive 
meaning (Figure 2a). Moreover, Model H4 is the most parsimoni- 
ous model and modification indices of Mplus were negligible and 
indicated no local misfit. Thus, we concluded that the most re- 
strictive form of measurement invariance was tenable for reason- 
ing and general CPS, hence allowing for the comparison of latent 
means on these constructs (but also manifest test scores) for 
students with differing immigration background. 

We also examined measurement invariance when CPS was 
conceptualized as a faceted construct (Figure 2b). Descriptive fit 
indices indicated acceptable model fit for our baseline model F/ 
representing configural invariance (see Table 2). The same held 
true for Model F2, suggesting that a weak invariant model of a 
faceted CPS construct was tenable for students with differing 
immigration background. Note that although x” was found to be 
significant for both Models FI and F2, we emphasized descriptive 
fit indices since local misfit was negligible. Latent correlations 
between reasoning and facets of CPS were found to be of the same 
size as those reported in the literature (Greiff et al., 2013; Kroner 
et al., 2005; Sonnleitner et al., 2013; Wiistenberg et al., 2012). 
Moreover, constraining the factor loadings to be equal for students 
with and without immigration background in Model F2 did not 
significantly impact model fit (Ay? = 10 with p = .19 and 
ASRMR = .01). When intercepts were restricted across groups in 
Model F3, Ax? indicated a significant change in model fit. Inspect- 
ing descriptive fit statistics, however, revealed only a slight 
change, with ACFI = —.01 and Agamma = —.01. Again we could 
not identify specific local misfit, and modification indices did not 


* Note that descriptive statistics for all manifest measures including 
covariance matrices for immigrant as well as nonimmigrant students are 
provided in the online supplemental materials. 


Table 2 
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Measurement Invariance of Reasoning and Complex Problem Solving Conceived Either as Hierarchical or as Faceted Construct 








Model x? df p CFI SRMR gamma Ax? Adf Pp ACFI ASRMR Agamma 
Measurement invariance of reasoning and hierarchical complex problem solving 
H1. Configural invariance 19 8 02 98 .04 no 
H2. Weak invariance 20 11 OS 98 .05 7 
H1 vs. H2 1 3 .80 .00 01 00 
H3. Strong invariance 28 14 ail 97 05 97 
H2 vs. H3 8 3 .05 Oil .00 .00 
H4. Strict invariance 29 19 .07 98 0S Ay) 
H3 vs. H4 I 5 96 O01 .00 .00 
Measurement invariance of reasoning and faceted complex problem solving 
Fl. Configural invariance 23 70 <.01 .96 .06 94 
F2. Weak invariance 133 77 <.01 95 .07 94 
Fl'vs. F2 10 a 19 Sea OL .00 
F3. Strong invariance 150 84 <.01 94 07 53) 
F2 vs. F3 17 7 .02 SO .00 ani 
F4. Strict invariance ee 95 <.01 94 08 2 
F3 vs. F4 12 11 36 .00 01 eal 


H-MIMIC 
F-MIMIC 


19 
113 


Adjustment of mean differences due to migration background and academic track 


10 
49 


.04 
<.01 





Note. 4H = hierarchical; F = faceted; x 


2 


98 
ROD 


.03 
.05 


residual; MIMIC = multiple-indicator, multiple-causes. 


suggest freeing any other parameters. Thus, strong measurement 
invariance was tenable. Introducing additional across-group con- 
straints on the unique factor invariances or measurement residuals 
in Model F4 did not significantly impact model fit (Ax? = 12 with 
p = .36). Moreover, changes in descriptive fit statistics did not 
exceed .01. As Model F4 was the most parsimonious of the posited 
models, and all parameters were substantive (Figure 2b), we con- 
cluded that even strict measurement invariance was tenable for 
CPS as a faceted construct. 

Taken together, the analyses of measurement invariance showed 
that meaningful comparisons for students with differing immigra- 
tion background could be made for reasoning and CPS scores, 
regardless of whether the latter were represented as facets of CPS 
or as a general CPS factor. 


Group Differences Due to Immigration Background 


Results given in the lower part of Figure 2a clearly show that 
students without immigration background significantly outper- 
formed their peers with an immigration background in reasoning 
(d = —0.30) as well as general CPS (d = —0.21). However, when 
group differences were inspected within a faceted conceptualiza- 
tion of CPS (lower part of Figure 2b), the difference did not hold 
for all facets of CPS equally. Specifically, results obtained for rule 
identification suggested that students with immigration back- 
ground outperformed their peers when identifying rules and gen- 
erating knowledge (RI, d = 0.21). Despite the effect’s small size, 
this finding is even more remarkable, since the other facets (as 
could be expected from findings concerning general CPS) are 
negatively impacted by immigration background, with d = = Om) 
for rule knowledge and d = —0.22 for rule application. In other 
words, these results suggest that with each subsequent phase of the 
problem solving process following the generation of knowledge, 


nO9) 
96 


chi-square goodness-of-fit statistic, CFI = comparative fit index; SRMR = standardized root-mean-square 


performance differences become more pronounced, resulting in the 
largest performance advantage for native students for rule appli- 
cation. Another interesting finding concerns the differential rela- 
tion of immigration background with reasoning and CPS. Irrespec- 
tive whether CPS was conceived of as a hierarchical or as faceted 
construct, measures of CPS seem to be somewhat less affected by 
immigration background than are measures of reasoning. 


Group Differences Due to Immigration Background 
and Academic Track 


To adjust performance differences with respect to immigration 
background for differential attendance of the academic track, we 
ran two MIMIC models that conceptualized CPS either as a 
hierarchical or faceted construct. Both MIMIC-models showed 
acceptable model fit (see Table 2). Despite significant x’ statistics, 
descriptive model fit indices suggested good fit to the data. As we 
had no indication of local misfits, interpretation of the obtained 
parameters seemed justified. In general (when controlling for 
immigration background), we observed that students attending the 
academic track outperformed students attending the nonacademic 
track on general CPS, facets of CPS, and reasoning ability (see 
Figure 3). Crucially, performance differences between students 
with and without immigration background became negligibly 
small and nonsignificant when we took the fact into account that 
students with immigration background were more likely to be 
enrolled in the nonacademic track (Figure 3). Only when CPS was 
conceptualized as a faceted construct did a substantial (and statis- 
tical significant) influence of immigration background remain for 
rule identification (d = 0.26), indicating that students with immi- 
gration background outperformed their native peers on this facet of 
CPS in both academic tracks. As a side note, academic track 


692 SONNLEITNER, BRUNNER, KELLER, AND MARTIN 


explained a considerably larger portion of variance in reasoning 
ability than in general CPS or facets of CPS, respectively. 


Discussion 


Educators across the globe increasingly emphasize the impor- 
tance of cross-curricular skills (Elliot Bennett et al., 2003; Kuhn, 
2009; Ridgway & McCusker, 2003). The importance of these 
skills has been illustrated through their introduction in several 
large-scale studies such as PISA (Leutner et al., 2012; OECD, 
2010). This suggests that the training of problem solving skills 
might eventually become an official part of school curricula and 
that computer-based assessment of complex problem solving will 
play a central role in the future of educational systems. However, 
as outlined in the introduction, there are several reasons why one 
might reasonably expect students’ cultural background to influ- 
ence performance in CPS tasks. The present study is the first to 
empirically examine (a) whether computer-based assessment of 
CPS is fair with regard to immigration background, and (b) 
whether CPS performance differences exist between students with 
and without immigration backgrounds. To answer these questions, 
the study drew on a Luxembourg sample of ninth grade students 
with and without immigration background who were enrolled in 
different academic tracks. 


Fairness of CPS With Regard to Immigration 
Background 


Several factors can be identified that might affect immigrant 
students’ performance in measures of complex problem solving. 
Besides culture-specific exploration and knowledge generation 
strategies (Giiss et al., 2010; Strohschneider & Giiss, 1999), im- 
migrant students might lack cultural knowledge about the context 
in which a problem is presented (Martin et al., 2012), and their 
often poorer language proficiency could impair performance even 
in largely “language-independent” subjects and assessments 
(Kempert et al., 2011; Levin & Shohamy, 2008). 

However, results of this study clearly showed that the adminis- 
tered microworld to assess CPS (i.e., the Genetics Lab; Sonnleitner 
et al., 2012) was measurement invariant and thus fair with regard 
to students’ immigration background. As the structure of CPS is 
still a matter of debate (Sonnleitner et al., 2013), we investigated 
whether a hierarchical conceptualization including a general CPS 
factor or a faceted conceptualization including the facets of rule 
identification, rule knowledge, and rule application could be mea- 
sured fairly in both groups. Crucially, the highest level of mea- 
surement invariance (1.e., strict measurement invariance) could be 
established for both conceptualizations of CPS (Models H4 and 
F 4). Thus, regardless of whether research is interested in a general 
CPS skill (e.g., PISA 2013; OECD, 2010) or the facets of CPS 
(e.g., in applied educational contexts), the obtained performance 
scores of the Genetics Lab seem to be measurement invariant and 
thus fair measures of CPS irrespective of students’ immigration 
background. In the light of findings on cross-cultural differences in 
exploration and knowledge acquisition strategies (Giiss et al., 
2010; Strohschneider & Giiss, 1999), it is even more remarkable 
that measurement invariance was also tenable for rule identifica- 
tion. Given the increase in the numbers of immigrant students in 
many countries worldwide (Schleicher, 2006), this is an important 


prerequisite for comparative small and large-scale studies. This 
finding also substantially contributes to promising previous results 
that established measurement invariance for the facets rule knowl- 
edge and rule application with regard to students’ sex, nationality 
(Germany vs. Hungary), and educational background (Greiff et al., 
2013; Wiistenberg et al., 2014). The present study is the first, 
however, showing that measurement invariance is also tenable for 
the facet rule identification, as this facet was not included in 
previous studies. Nevertheless, given specific features of the Ge- 
netics Lab such as game-like characteristics, and multilingual- 
friendly features (e.g., multilingual, multimedia instructions and a 
help function), our findings cannot automatically be generalized to 
other microworlds assessing CPS (see also Greiff et al., 2014). 


Performance Differences With Regard to Immigration 
Background 


In line with previous findings (OECD, 2012; Schleicher, 2006), 
students with immigration background were generally outper- 
formed by their native peers. Results of Model H4 (Figure 2a) 
indicated that native students showed a significantly higher general 
skill to solve complex problems than their immigrant peers. A 
comparison with a classic, paper-pencil-based measure of reason- 
ing, however, revealed that immigration background showed a 
stronger influence on reasoning than on general CPS (Figure 2a). 

Crucially, the investigation of a faceted CPS conceptualization 
(Model F4, Figure 2b) was shown to be of substantial value as a 
possible explanation for the results in Model H4 concerning gen- 
eral CPS. In the faceted Model F4, we could show that immigrant 
students applied a somewhat more efficient exploration strategy 
than their nonimmigrant peers. In the subsequent problem solving 
steps, however, students without immigration background outper- 
formed their immigrant peers. Given these results, it seems that 
students with immigration background might have difficulties 
transferring the generated information about a problem into de- 
clarative knowledge depicted in causal diagrams or applied to 
achieve certain target values. Interestingly, the performance gap 
increased with each phase following the generation of knowledge, 
resulting in the largest performance difference in the third and final 
problem solving phase of rule application. 

Although reasoning tasks as well as CPS tasks draw on cogni- 
tive processes that are responsible for the acquisition and the 
application of knowledge (Wiistenberg et al., 2012), reasoning 
tasks only provide the final results of these processes. Thus, 
compared to CPS tasks they might underestimate immigrant stu- 
dents’ overall ability. For instance, if rule application is most 
important in determining the correctness of a multiple-choice 
answer, immigrant students’ weakness in transferring and applying 
the gathered knowledge might result in a stronger effect of immi- 
gration background on reasoning scales and hide the strength of 
immigrant students in rule identification. Importantly, this finding 
clearly shows the value of a faceted conceptualization of CPS and 
demonstrates that future studies should not only investigate knowl- 
edge acquisition or knowledge application or a general CPS skill 
(as in PISA 2013) but also students’ problem exploration strate- 
gies. 

Performance differences when taking academic track into 
account. On the basis of previous findings showing a strong 
influence of educational background on CPS skill (e.g., Greiff et 
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al., 2013) and the fact that the majority of immigrant students in 
Luxembourg is enrolled in nonacademic tracks (Burton & Martin, 
2008; Klapproth et al., 2013), we investigated performance differ- 
ences in students of differing immigration backgrounds while at 
the same time controlling for the academic track in which they 
were enrolled (Figure 3, Models H-MIMIC and F-MIMIC). Impor- 
tantly, results of both MIMIC-Models clearly showed that aca- 
demic track explained most of the differences found for immigra- 
tion background, with students enrolled in the academic track 
clearly outperforming their peers in the nonacademic track. Thus, 
performance differences in CPS and reasoning due to immigration 
background as found in Models H4 and F4 may have simply been 
due to the fact that the majority of immigrant students were 
enrolled in the nonacademic track. This finding also indicates that 
being educated in the academic track may improve performance in 
these tasks, even if it could also be assumed that initial perfor- 
mance differences might have contributed to the placement deci- 
sion for the nonacademic track. Note that a positive influence of 
academic track attendance has also been found for performance on 
reasoning tasks (Becker et al., 2012; Gustafsson, 2008). 

The faceted conceptualization of CPS in Model F-MIMIC, 
however, again provided a substantial finding. Students with 
immigration background applied more efficient exploration 
strategies than their native peers irrespective of the academic 
track they attended. Although descriptive statistics indicated 
that this difference was only small in size and could only be 
found in the nonacademic track, manifest measures were not 
free from measurement error and may thus underestimate real 
(latent) differences. For individual assessment, this finding is of 
special importance, as it may point to otherwise overlooked 
potential. As mentioned above, especially in Luxembourg, 
many students with immigration background are oriented to 
nonacademic tracks due to low language proficiency in any or 
all of the country’s three official languages, Luxemburgish, 
French, and German (Burton & Martin, 2008; Klapproth et al., 
2013; Shewbridge, Tamassia, Santiago, & Ehren, 2012). Given 
the positive influence of academic track on reasoning scales 
found in our study and reported by Becker et al. (2012) and 
Gustafsson (2008), such scales may not be suited to assess 
students’ real cognitive potential; they might simply reflect the 
training that is (or is not) provided in a specific academic track. 
Importantly, we still found a significant relation between im- 
migration background and rule identification after controlling 
for academic track, indicating that educational background had 
less influence on this skill. Since systematic exploration behay- 
ior is presumably trained less extensively in school than think- 
ing in causal diagrams or the application of knowledge, novelty 
of task demands might explain this finding. This is possibly 
reflected in the generally lower performance-scores in rule 
identification compared to the other assessed abilities (see 
Table 1). Consequently, rule identification might be used to 
identify immigrant “underachievers” who are enrolled in non- 
academic tracks due to their lower language proficiency but 
have the cognitive potential to succeed in an academic track. In 
sum, results suggest that CPS may be a fairer and more valid 
assessment of students’ cognitive abilities than traditional rea- 
soning scales, and this may in particular hold for the dimension 
of rule identification. 


Limitations and Outlook 


A crucial aspect in investigating immigrant students is the 
definition of immigration background. For the present article, we 
applied an internationally well-established classification, defining 
students as having immigrant status if they were either born abroad 
and later moved to the host country or they were born in the host 
country but both parents were born abroad (cf. OECD, 2012). 
Despite broad acceptance of this term, there is still ambiguity 
concerning students who have only one parent born abroad or 
parents who were born in two different countries. In these cases, a 
classification concerning immigrant status or country of origin is 
difficult (Stanat et al., 2010). This, however, highlights the heter- 
ogeneity of most immigrant student samples. Nevertheless, from 
an educational or even political perspective, focus should be set on 
analyzing groups that are homogeneous with regard to perfor- 
mance, as they allow broader conclusions for everyday educational 
practice. The grouping of first- and second-generation immigrant 
students in our study led to such a (relatively) homogeneous 
sample with regard to cognitive performance. 

The current investigation of measurement invariance and stu- 
dents’ performance differences with regard to immigration back- 
ground would undoubtedly have profited from a larger sample. 
Although the percentage of immigrant students in both school 
tracks can be seen as representative for Luxembourg (Burton & 
Martin, 2008), size of subsamples was too low to conduct more 
fine-grained analyses, for example concerning the effect of country 
of origin. Note, however, that sample size was large enough to 
detect statistically significant effects for the group as a whole. 
Nevertheless, this study should only be seen as a Starting point. 
Further research is needed to generalize our findings to different 
school grades and countries or to study specific subsamples of 
immigrant students with special risk factors. Such studies would 
undoubtedly be beneficial for educational policies aimed at reduc- 
ing the performance gap between immigrant students and their 
native peers. The upcoming PISA 2012 data set (OECD, 2010), 
including measures on CPS and detailed information on immigra- 
tion background, might be a rich source for such analyses. 

Another limitation concerns the interpretation of the observed 
effects of immigrant background and school track. Since random 
allocation is impossible in such contexts, it is not possible to 
establish causality. Positive effects of academic track can either be 
interpreted as positive influence of the advanced training on prob- 
lem solving skills or simply as evidence that students with better 
problem solving skills are more likely to attend an academic track. 
Longitudinal studies would provide interesting information on 
such schooling effects but would nevertheless be limited by non- 
random allocation of students. 

Although a slight advantage in rule identification could be 
identified for immigrant students, it is still unclear why these 
differences occur. More fine-grained analyses of the exploration 
strategies shown in students’ log-file data (Hadwin, Nesbit, 
Jamieson-Noel, Code, & Winne, 2007) or even think-aloud pro- 
tocols might shed light on this issue, as such studies would allow 
understanding the exact strategies that students use in order to 
identify the underlying rules. Previous qualitative studies have 
shown, for example, that German university students employ more 
control-oriented strategies in CPS tasks than students from India or 
Brazil (Giiss et al., 2010; Strohschneider & Giiss, 1999). A similar 
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effect might explain our observed differences in rule identification, 
though the results of existing studies cannot automatically be 
generalized to our study, since the problem solving strategies 
applied by students depend strongly on the specific microworld 
used to assess CPS. 


Conclusion 


In sum, the present study provides evidence that CPS as as- 
sessed by the Genetics Lab, a computer-based microworld that 
incorporates multilingual-friendly features (e.g., multilingual, mul- 
timedia instructions, and a help function) can be equitably mea- 
sured with respect to students’ immigration background. Such 
fairness is a prerequisite for small and large-scale studies (such as 
PISA) that aim to compare complex problem-solving abilities of 
students with differing immigration backgrounds. Special value 
was shown in the analysis of students’ exploration strategies, 
highlighting the informative potential of computer-derived process 
measures and pointing to a future direction in CPS research. This 
last point highlights the value of a faceted approach for CPS, 
which might have the potential to reveal especially in immigrant 
students cognitive facets that can be considered to be a relative 
strength for these students and that might go unnoticed by educa- 
tors without the existence of adequate CPS assessments. Thus, 
CPS assessment shows a high potential for being a central element 
in future educational curricula with a strong focus on cross- 
curricular skills. 
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motivation 


Research on achievement emotions has largely focused on test 
anxiety (Zeidner, 1998, 2007). Emotions other than anxiety were 
neglected, despite their ubiquity in the classroom and relevance to 
students’ academic performance, development, and health 
(Pekrun, Goetz, Titz, & Perry, 2002). Recent progress in educa- 
tional research on emotions has begun to draw attention to the 
diversity of emotions that students experience (Linnenbrink- 
Garcia & Pekrun, 2011; Pekrun & Linnenbrink-Garcia, in press; 
Schutz & Lanehart, 2002; Schutz & Pekrun, 2007). However, 
surprisingly few of these studies have featured students’ boredom. 
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Outside academia, boredom has been shown to relate to delin- 
quency, gambling, drug use, and health problems (Amos, Wilt- 
shire, Haw, & McNeill, 2006; Blaszczynski, McConaghy, & 
Frankova, 1990; Newberry & Duncan, 2001; Thackray, 1981). 
Thus, it seems plausible to assume that boredom can have equally 
pronounced effects within achievement settings. Boredom is an 
emotion that is among the most frequently experienced, and po- 
tentially most devastating, affective states occurring in the class- 
room (Mann & Robinson, 2009; Pekrun, Goetz, Daniels, Stupni- 
sky, & Perry, 2010). 

Accordingly, the present research examined the relationship 
between students’ boredom and their academic achievement. It is 
posited that these two constructs are related by reciprocal causa- 
tion, in contrast to traditional unidirectional models of emotions 
affecting achievement. Sporadic correlational evidence suggests 
that boredom is negatively related to academic achievement, as 
detailed below. However, the underlying causal relationships are 
virtually unexplored. Hence, the reciprocal linkages between these 
two constructs have not been examined, leaving unresolved the 
question whether boredom is functionally relevant in impacting 
performance, or is just an epiphenomenon of low achievement. 

Examining reciprocal relations between boredom and achieve- 
ment is of considerable theoretical and practical importance. Evi- 
dence on reciprocal relations bears on the validity of systems- 
oriented theories proposing that emotions and performance are 
linked by feedback loops rather than by unidirectional causation 
(Pekrun, 2006; Turner & Waugh, 2007). With regard to educa- 
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tional practice, it is important for teachers, administrators, and 
parents to know whether boredom is detrimental to students’ 
performance, or simply a by-product of poor achievement that is of 
secondary relevance because it does not impact future educational 
attainment. 

In the following sections, we first discuss the construct of 
boredom as an achievement emotion. We then summarize the 
paucity of evidence on boredom and achievement and present a 
theoretical model on their reciprocal linkages. The model was 
tested in an empirical study that investigated course-related bore- 
dom and course performance in undergraduate students over a full 
academic year. 


Boredom as an Achievement Emotion 


Boredom comprises unpleasant feelings, reduced physiological 
arousal, perceived lack of cognitive stimulation, task-irrelevant 
thinking (e.g., daydreaming), prolonged subjective duration of 
time, and impulses to escape the boredom-inducing situation 
through disengagement (Goetz & Hall, in press; Mikulas & Vo- 
danovich, 1993; Pekrun et al., 2010; van Tilburg & Igou, 2012; 
Vogel-Walcutt, Fiorella, Carper, & Schutz, 2012; for variants of 
the boredom experience, see Goetz et al., in press). From an 
evolutionary perspective, boredom serves to limit engagement in 
activities that lack consummatory value, that do not promise to 
yield any reinforcement, and that are not suited to broaden the 
individual’s thought-action repertoire, thus making it possible to 
redirect attention toward more rewarding activities. 

Boredom is experienced frequently by students in academic 
achievement settings (Larson & Richards, 1991; Mann & Robin- 
son, 2009; Pekrun et al., 2010). In contrast to emotions linked to 
success and failure outcomes (e.g., anxiety, pride, or shame), 
boredom relates to the achievement activities performed in these 
settings, such as attending classes or doing homework. Achieve- 
ment emotions are defined as emotions related to achievement 
activities or their outcomes (Pekrun, 2006) and can be classified 
using the 2 X 2 (Object Focus X Valence) taxonomy of achieve- 
ment emotions proposed by Pekrun et al. (2002). Within this 
taxonomy, boredom represents an activity-related achievement 
emotion since the reference object is the current achievement 
activity, rather than the outcome of the activity. 

As boredom involves low arousal and distinct affective, cogni- 
tive, and motivational components as described earlier, it differs 
from other negative emotions such as anger, test anxiety, or shame 
(van Tilburg & Igou, 2012). Correlations between boredom and 
other negative emotions such as test anxiety have been in the range 
of r = .30 to .50, thus corroborating the distinctiveness of this 
emotion (Daniels et al., 2009; Pekrun, Goetz, Frenzel, Barchfeld, 
& Perry, 2011; van Tilburg & Igou, 2012). Moreover, boredom is 
also different from a mere lack of positive affect, situational 
interest, or flow experiences (Csikszentmihalyi, 1975). Boredom iS 
an unpleasant affective state that consists of specific component 
processes that can be highly aversive, thus being more than simply 
the absence of positive affect. 

Regarding situational interest (Hidi & Renninger, 2006), bore- 
dom can arise from a lack of interest but is not identical with it 
(Pekrun et al., 2010). Situational interest, as well as lack of 
interest, can relate to various positive and negative emotions (e.g., 
enjoyment, disgust) but are distinct from any specific single emo- 


tion (for a review, see Ainley & Hidi, in press). As such, lack of 
situational interest need not be combined with boredom. For ex- 
ample, when preparing for an exam, a student may lack interest in 
the material but may feel panic and a fear of failing the exam rather 
than boredom. From a motivational perspective, lack of interest is 
conceptually equivalent to a lack of approach motivation, whereas 
boredom is equivalent to avoidance motivation in wishing to 
escape the situation, which implies that lack of interest and bore- 
dom belong to different categories of affect. 


Previous Research on Boredom and Academic 
Achievement 


In this section, we summarize the available evidence on linkages 
between students’ boredom and their academic achievement. 
Given the dearth of direct evidence on the boredom—achievement 
link, we also consider more indirect evidence arising from findings 
on boredom and students’ ability and engagement underlying 
achievement. 


Boredom and Academic Achievement 


Maroldo (1986) and Pekrun et al. (2010) found that college 
students’ boredom correlated negatively with their grade point 
average. Similarly, research with middle and high school students 
also found boredom to correlate negatively with academic achieve- 
ment. Frenzel, Pekrun, and Goetz (2007) reported that fifth to 10th 
graders’ boredom during math classes related negatively to their 
math achievement. In an investigation of domain-specific achieve- 
ment emotions, boredom and academic grades in different school 
subjects correlated negatively in samples of eighth and 11th grad- 
ers (Goetz, Frenzel, Pekrun, Hall, & Ltidtke, 2007). An exception 
to this pattern of uniformly negative correlations is a study re- 
ported by Larson and Richards (1991). Using an experience sam- 
pling method to assess fifth to ninth graders’ boredom during 
schoolwork, this study found small positive correlations between 
boredom and students’ achievement (rs = .15 and .13 for grade 
point average [GPA] and test scores, respectively). 

Whereas all these studies were cross-sectional, the studies re- 
ported by Ahmed, van der Werf, Kuyper, and Minnaert (2013); 
Pekrun, Elliot, and Maier (2009); and Pekrun et al. (2010, Study 5) 
used longitudinal designs. Ahmed et al. (2013) found that seventh 
graders’ boredom was negatively related to their math achieve- 
ment, and that change in boredom over one school year was 
negatively related to concurrent change of math achievement. In 
Pekrun et al.’s (2009) investigation, undergraduates’ boredom 
arising from studying for a course exam was a negative predictor 
of performance on the exam, and Pekrun et al. (2010) found that 
students’ boredom in a university course negatively predicted their 
end-of-year course performance. These studies suggest that bore- 
dom is a negative predictor of students’ academic performance, but 
they did not examine reverse effects of performance on boredom. 
In sum, students’ boredom experienced in academic achievement 
settings has almost uniformly been found to correlate negatively 
with their achievement; however, evidence on reverse effects of 
achievement on boredom, and on reciprocal relations between the 
two constructs over time, is lacking. 
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Boredom and Cognitive Ability 


Traditionally, it was assumed that boredom is caused by a lack 
of challenge resulting from a combination of high individual 
ability and low task demands (Csikszentmihalyi, 1975). In the 
educational literature, boredom was attributed to gifted children 
dealing with environments tailored to the needs of average-ability 
students (“The bored and disinterested gifted child”; Sisk, 1988, p. 
5; also see Rennert & Berger, 1956). By contrast, the evidence 
from a few survey studies suggests that boredom is more fre- 
quently experienced by low-ability individuals. Roseman (1975) 
found that bored students were overrepresented among middle 
school students having IQ scores less than 95. Similarly, Fogelman 
(1976) showed that 1 1-year-olds who reported being “often bored” 
in their spare time had significantly lower cognitive abilities than 
students who were “sometimes bored” or “always enjoyed” their 
leisure time. Congruent with lower ability scores, bored students in 
middle school also reported lower self-concepts of ability (Goetz, 
Pekrun, Hall, & Haag, 2006). This evidence suggests that both 
objective and perceived cognitive ability are negatively related to 
students’ boredom. 


Boredom and Achievement Behavior 


A few studies indicate that boredom relates negatively to stu- 
dents’ attention and effort in achievement activities. Farmer and 
Sundberg (1986) reported that undergraduates’ boredom proneness 
correlated negatively with their attentiveness during lectures. Sim- 
ilarly, Mann and Robinson (2009) found that university students’ 
boredom related to their off-task behavior during lectures and to 
missing future lectures. Watt and Vodanovich (1999) demon- 
strated that college students’ boredom related negatively to their 
educational involvement and career planning. Pekrun et al. (2002, 
2010) reported that university students’ boredom was negatively 
related to their attention, effort, self-regulation of learning, and use 
of flexible learning strategies. 

Similarly, boredom relates negatively to academic engagement 
in middle and high school students. Based on interviews with sixth 
and seventh grade students, Jarvis and Seifert (2002) found that 
students withdrew effort at school because of being bored. In 
Roseman’s (1975) investigation, students’ boredom related nega- 
tively to teacher and parent ratings of how hard students worked. 
Skinner, Furrer, Marchand, and Kindermann (2008) reported that 
fourth to seventh graders’ boredom correlated negatively with their 
behavioral engagement. Finally, Ahmed et al. (2013) found that 
seventh grade students’ boredom in mathematics related nega- 
tively to their use of learning strategies. Overall, the findings 
indicate that boredom relates negatively to students’ attention, 
investment of effort, and self-regulation of learning. 


Summary of Previous Research 


In sum, boredom has been found to correlate negatively with 
students’ academic achievement as well as related variables in- 
cluding cognitive ability, academic self-concepts, and behavioral 
engagement. By contrast, the extant research does not support the 
classical notion that high-achieving students suffer more from 
boredom in achievement settings, compared with average- 
achieving or low-achieving students. Rather, the evidence suggests 
that boredom is linked to low achievement. 


However, as most of the available evidence is cross-sectional, 
causal conclusions are not warranted. Correlations between bore- 
dom and achievement leave unanswered the question of whether 
boredom reduces achievement, low achievement causes boredom, 
or both are correlated due to third variables. For examining recip- 
rocal effects, panel designs would be needed that measure both 
boredom and performance at multiple points of time, thus making 
it possible to analyze effects of one variable on the other while 
controlling for previous levels of both variables (McArdle, 2009). 
No analysis of this type is available to date. As yet, there is no 
empirical answer to address how boredom and achievement recip- 
rocally influence each other, and how these influences unfold over 
time. 


Theoretical Framework: A Reciprocal Effects Model 
of Boredom and Achievement 


Hypotheses regarding linkages between boredom and achieve- 
ment were derived from Pekrun’s (2006; Pekrun & Perry, 2013) 
control-value theory of achievement emotions. This theory inte- 
grates propositions from expectancy-value, attributional, and con- 
trol approaches to achievement emotions (Folkman & Lazarus, 
1985; Pekrun, 1992; Turner & Schallert, 2001; Weiner, 1985). It 
expands upon these approaches by addressing not only outcome 
emotions, but activity emotions such as boredom as well. The 
theory posits that achievement emotions are aroused by cognitive 
appraisals of control over, and the subjective value of, achieve- 
ment activities and their outcomes. Control appraisals consist of 
perceptions of one’s ability to successfully perform actions (1.e., 
self-efficacy expectations) and to attain outcomes (outcome ex- 
pectations). Value appraisals pertain to the perceived importance 
of these activities and outcomes. Furthermore, the theory posits 
that these emotions in turn influence achievement behavior and 
performance. Since performance outcomes shape succeeding per- 
ceptions of control over performance, one important implication is 
that emotions, their appraisal antecedents, and their performance 
outcomes are linked by reciprocal causation. In terms of reciprocal 
causation, the theory is consistent with reciprocal effects models 
for variables such as students’ self-concepts (Marsh & Craven, 
2006; Marsh, Trautwein, Liidtke, Kéller, & Baumert, 2005), 
achievement goals (Linnenbrink & Pintrich, 2002), interest (Har- 
ackiewicz, Durik, Barron, Linnenbrink-Garcia, & Tauer, 2008), 
and anxiety (Pekrun, 1992). 


Effects of Boredom on Achievement 


The control-value theory proposes that achievement emotions 
influence students’ cognitive resources, motivation to learn, and 
use of learning strategies. Boredom is expected to reduce cognitive 
resources by producing task-irrelevant thinking (e.g., daydream- 
ing) and increasing distractibility. Furthermore, boredom is pro- 
posed to induce motivation to escape from the achievement set- 
tings that cause boredom, thereby reducing students’ motivation to 
learn. Finally, boredom is posited to impair the use of learning 
strategies and to promote superficial information processing in- 
stead. Given these negative effects, boredom is expected to uni- 
formly impair students’ achievement. 
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Reverse Effects of Achievement on Boredom 


Achievement reciprocally influences the appraisals that are 
thought to be proximal antecedents of boredom. More specifically, 
the control-value theory proposes that boredom is instigated when 
perceived control over achievement activities and the perceived 
value of these activities are too low (for empirical evidence, see 
Bieg, Goetz, & Hubbard, 2013; Pekrun et al., 2010). For example, 
if a student does not understand the contents of a lecture and 
perceives them as lacking relevance, perceived control and value 
are low, and boredom is expected to follow. Alternatively, bore- 
dom can be induced if control is too high, as implied by tasks 
involving low demands (e.g., repetitive monitoring tasks; Fisher, 
1993). This second possibility is congruent to Csikszentmihalyi’s 
(1975) view that boredom is caused when one’s competences are 
high relative to task demands, which would imply high control 
over task performance. 

Combining these propositions amounts to assuming a U-shaped 
curvilinear relationship between control and boredom, with bore- 
dom being promoted by either very low or very high control (i.e., 
a mismatch between demands and capabilities). However, some 
academic environments, such as the environments encountered by 
lst-year students at university, pose significant challenges, making 
it likely that the high levels of perceived control that would 
promote boredom are rarely achieved in these environments. Ac- 
cordingly, previous research has found that relationships between 
university students’ perceived control and boredom were nega- 
tively linear and did not contain any curvilinear components 
(Pekrun et al., 2010). For the purposes of the present research, we 
therefore expect the relationships between students’ perceived 
control and their boredom to be negatively linear rather than 
curvilinear. 

Perceived control over achievement activities depends on stu- 
dents’ individual achievement history, with success strengthening 
control and failure undermining it. Hence, achievement is expected 
to have positive effects on perceived control. By implication, since 
achievement has positive effects on control and control has neg- 
ative effects on boredom, it follows that students’ achievement 
should have negative effects on their academic boredom. In line 
with the proposed linear nature of the control—boredom link, we 
expect the effects of achievement on boredom to be linear as well. 


Feedback Loops of Boredom and Achievement 
Over Time 


Because boredom is posited to influence achievement and 
achievement, in turn, to influence boredom, the two constructs are 
thought to be linked by reciprocal causation over time. Both effects 
are expected to be negative, amounting to positive feedback loops. 
This proposition implies that boredom, as a deactivating emotional 
experience, is different from activating negative emotions such as 
anxiety which may well be characterized by negative feedback 
loops in some individuals (e.g., failure on an exam instigating 
anxiety, and anxiety eliciting effort to avoid failing the next exam, 
Pekrun, 1992). 

Feedback loops involving reduced achievement and increased 
boredom beg the question: What comes first, boredom or achieve- 
ment? Given that achievement emotions originate early in the 
preschool years, this question appears to represent a chicken-and- 


egg problem for later developmental phases. However, when en- 
tering a novel academic environment, boredom may develop first, 
triggered by diminished perceived control resulting from difficul- 
ties in understanding course material or by a lack of interest. These 
initial boredom experiences jeopardize students’ ongoing learning 
behaviors, thereby negatively impacting academic achievement 
over time. Achievement is typically assessed later, several weeks 
or months into the semester. Achievement feedback then influ- 
ences subsequent boredom, and feedback loops of boredom and 
achievement can continue to unfold across subsequent phases of 
studying and testing achievement. 


Overview of the Present Research 


We tested the proposed reciprocal effects model using a longi- 
tudinal investigation of university students’ course-related bore- 
dom and achievement. The sample included undergraduate stu- 
dents enrolled in introductory psychology courses spanning an 
entire academic year (two semesters). As noted, for testing models 
of reciprocal causal linkages, designs are needed that assess both 
variables at multiple points in time, either concurrently or in 
alternating order (Little, Preacher, Selig, & Card, 2007; McArdle, 
2009; Rosel & Plewis, 2008). Therefore, the study used a fine- 
grained design that included five assessments of boredom across 
the academic year, as well as five assessments of performance on 
course exams following each boredom assessment. Keeping the 
design in line with our proposition that boredom can initiate 
feedback loops with achievement by developing early in a new 
environment and affecting subsequent achievement, the first as- 
sessment of boredom preceded the first exam. This study design 
made it possible to conduct multiple tests for the effects of bore- 
dom on subsequent performance, and of performance on subse- 
quent boredom, while controlling for prior boredom and achieve- 
ment levels. 

Structural equation modeling was used to competitively test the 
reciprocal effects model against alternative models, including a 
unidirectional model only including effects of boredom on 
achievement (boredom effects model), a unidirectional model only 
including effects of achievement on boredom (achievement effects 
model), and an autoregressive model that did not contain any 
directional effects between boredom and achievement. To ensure 
that any observed relations were not mere artifacts of other plau- 
sible variables, we controlled for demographic and academic back- 
ground variables (gender, age, and high school achievement) in 
each of the models. In a supplemental analysis, we additionally 
controlled for students’ interest and intrinsic motivation to exam- 
ine if the effects linking boredom and achievement were robust 
when including these related variables. 


Method 


Participants and Procedure 


Three weeks into the academic year, 424 students were recruited 
from a two-semester introductory psychology course at a Canadian 
research-intensive university for a web-based study in exchange 
for experimental credit. In the initial sample, 66% were female, 
79% reported English as their first language, and 89% were under 
25 years of age (M = 20.46 years, SD = 4.14). Most participants 
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were in their Ist year of study (69%), and the average grade for 
students’ final year of high school was 81%. 

Students were required to complete a web-based questionnaire 
at five points throughout the first and second semesters. The first 
questionnaire was completed during the 3rd and 4th weeks of 
classes (Time 1 questionnaire). The remaining four questionnaires 
were completed within 10 days after each of students’ next four 
test results were posted (Times 2—5 questionnaires). One exception 
was the Time 4 questionnaire that was completed during the first 
10 days of the second semester due to the results of Test 3 being 
posted during the winter break between the two semesters. Web 
survey access was restricted to campus computing facilities to 
prevent distraction and allow for access to technical support staff, 
and was available only during the five time periods specified. 
Study reminders were provided through in-class announcements, 
e-mail updates, and printed notices displayed beside the posted 
course grades. 

Some attrition occurred from one phase to the next due to 
students having already completed their experimental credit re- 
quirements, withdrawing from the course, or illness. However, the 
extent of the attrition observed was minimal: 4% from Times | to 
2, 5% from Times 2 to 3, 6% from Times 3 to 4, and 1% from 
Times 4 to 5; total attrition rate was 16%. These attrition rates 
show considerable engagement in the web-based study protocol 
and are below those observed in similar pencil-and-paper studies 
(e.g., 21%: Hall, Perry, Chipperfield, Clifton, & Haynes, 2006; 
20%: Perry, Hladkyj, Pekrun, & Pelletier, 2001). 

A regression analysis on a continuous measure of study 
attrition was conducted (number of boredom assessments not 
completed; M = 0.43, SD = 1.07, range = 0-4). Predictors 
included gender, age, high-school grades, initial course perfor- 
mance (Test 1), and Time | boredom. Results showed students 
with better initial performance to withdraw from fewer assess- 
ments (8 = —.27, p < .01). However, the proportion of variance 
in study attrition explained by the predictors was small (R* = 
.10). 


Study Measures 


The measures included a self-report scale of boredom, ob- 
jective test performance, demographic background variables 
(gender, age, and high-school grades), and affective back- 
ground variables (interest and intrinsic motivation). Means, 
standard deviations, and ranges for all measures in each study 
phase are outlined in Table 1. 

Boredom. A six-item version of the learning-related boredom 
scale of the Achievement Emotions Questionnaire (AEQ; Pekrun 
et al., 2011) was used to assess students’ learning-related boredom 
concerning their psychology course (e.g., “When studying for this 
course, I feel bored”; 1 = not at all true, 5 = completely true; see 
Appendix A for the scale items). The measure showed high inter- 
nal reliability (as = .88, .89, .90, .91, and .92, for Times 1, 2, 3, 
4, and 5, respectively). 

Achievement. Students’ grade percentages on the five tests in 
introductory psychology that followed the boredom assessments 
were obtained from course instructors throughout the academic 
year. Each exam was criterion-referenced, of equal weight, non- 
cumulative in content, and involved a multiple-choice format with 
equivalent numbers of items. The tests were administered approx- 
imately one month apart. 

Covariates. 

Demographic background variables. Three demographic 
background variables were included in the analyses—namely, 
gender, age, and high-school grades. High-school grades consisted 
of students’ average final grade, computed as a percentage, in 
university pre-requisite courses (i.e., English, mathematics, chem- 
istry, physics) completed during their final year of high school. 
Since Scholastic Aptitude Tests (SATs) are not administered to 
Canadian students, high-school grades were used as a proxy for 
pre-existing aptitude differences, based on research showing high- 
school achievement to strongly predict college performance (e.g., 
Hoffman, 2002; Zheng, Saunders, Shelley, & Whalen, 2002). 

Interest and intrinsic motivation. Measures of interest and 
intrinsic motivation were available in the Time | questionnaire and 





Table 1 
Descriptive Statistics for the Study Variables 
Variable M SD Possible range Observed range 
Boredom 
Time | 11.40 4.21 6-30 6-30 
Time 2 11.38 4.17 6-30 6-29 
Time 3 Ua 4.62 6-30 6-30 
Time 4 ESI Su8 6-30 6-30 
Time 5 12.70 2 6-30 6-30 
Performance 
Test | 78.26 13.14 0-100 22.81-100 
Test 2 73.54 14.12 0-100 27.78-100 
Test 3 72.90 14.01 0-100 18.46-100 
Test 4 74.44 14.54 0-100 13.20-100 
Test 5 73.98 14.23 0-100 26.09-100 
Demographic background variables 
Age 20.46 4.14 17-45 
High-school grades 81.06 8.70 0-99 55-98 
Affective background variables 
Interest 8.38 Ss) 1-10 2-10 
Intrinsic motivation 24.00 4.33 5-35 9-33 
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used to control for these constructs in the supplemental analysis. 
Students’ interest in the course was measured using one self-report 
item (“I think that what we learn in my Introductory Psychology 
course is interesting’; 1 = strongly disagree, 10 = strongly 
agree). Intrinsic motivation was measured based on the intrinsic 
goal orientation scale from the Motivated Strategies for Learning 
Questionnaire (MSLQ; Pintrich, Smith, Garcia, & McKeachie, 
1991; five items, e.g., “I prefer course material that arouses my 
curiosity, even if it is difficult to learn”; 1 = not at all true of me, 
7 = very true of me; a = .71). 


Rationale for Structural Equation Modeling 


Structural equation modeling (Mplus, Version 7; Muthén & 
Muthén, 2012) was used to evaluate the reciprocal effects model 
and test it against more constrained models. The model represents 
a sequential analysis of reciprocal effects consistent with the 
sequential manner in which the measures were assessed (for a 
similar procedure, see, e.g., Marsh & O’ Mara, 2008). In contrast to 
a traditional cross-lagged model in which variables are assessed 
simultaneously within each measurement occasion, boredom and 
performance were modeled in alternating order consistent with the 
data collection process (see Figure 1). As such, the present model 
includes five paths from boredom to performance and four paths 
from performance to boredom. The five boredom variables were 
modeled as latent constructs. The five test performance measures 
and the three background measures (gender, age, and high-school 
grades) were evaluated as manifest variables. The background 
variables were included as covariates; for each of these variables, 
directional paths to all of the boredom and performance variables 
were included, as were correlations between the background vari- 
ables. 

Measurement model for boredom. The six boredom scale 
items were used as indicators for each of the five latent boredom 
variables. Following recommendations by Pekrun et al. (2011), a 
correlated uniquenesses approach was used to model boredom 
within each boredom assessment by including correlations be- 
tween residuals for items representing the same emotion compo- 
nent (Items 1 and 2, 3 and 4, and 5 and 6 for the affective, 
cognitive, and physiological—motivational components of bore- 
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dom, respectively). In addition, correlations between residuals for 
the boredom items across measurement occasions were included to 
control for systematic measurement error. d 

Hierarchical data structure, estimator used, and missing 
values. The university course from which the study sample was 
drawn consisted of five sections taught by different instructors. As 
students were nested in these sections, we corrected for the clus- 
tering of the data using the “type = complex” option implemented 
in Mplus (Muthén & Muthén, 2012). To estimate the model 
parameters, the robust maximum likelihood estimator (MLR) was 
employed, which is robust to nonnormality of the observed vari- 
ables. In order to make full use of the data from students who had 
missing data, we applied the full information maximum likelihood 
method (FIML; Enders, 2006) implemented in Mplus. 

Sequential testing of the reciprocal effects model. The re- 
ciprocal effects model was tested in a sequential manner. We first 
tested the measurement invariance of the boredom measure across 
time by comparing two measurement models, an unconstrained 
baseline model and a strict factorial invariance model that con- 
strained factor loadings, item intercepts, and item residuals to be 
equal across the five measurement occasions (Brown, 2006). Sub- 
sequently, we tested the reciprocal effects model (Model 1; Figure 
1) against three alternative models (see Figure 2; McArdle, 2009): 
a boredom effects model that estimated effects of boredom on 
subsequent achievement but constrained the effects of achieve- 
ment on boredom to be zero (Model 2); an achievement effects 
model that estimated the effects of achievement on boredom but 
constrained the effects of boredom on achievement to be zero 
(Model 3); and an autoregressive model that constrained all effects 
of boredom on performance, and vice versa, to be zero (Model 4). 
Autoregressive effects for boredom and achievement were in- 
cluded in all of these models, and the background variables were 
included as covariates. Finally, we conducted a supplemental 
analysis of the reciprocal effects model that additionally included 
interest and intrinsic motivation as covariates (Model 5). In this 
model, interest was evaluated as a manifest variable, and intrinsic 
motivation as a latent variable using the five items of the intrinsic 
motivation scale as manifest indicators. 
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Figure 1. 


Results for Model 1 (reciprocal effects model). For the covariates, significant paths are displayed 


only (see Table 3 for more complete information). Gender was coded male = 1, female = 2. “p < .05. “p < 
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Figure 2. Structure of Models 2—4. (a) Model 2 (boredom effects model). (b) Model 3 (achievement effects 


model). (c) Model 4 (autoregressive model). 


Goodness-of-fit indexes to evaluate model fit. We applied 
both absolute and incremental fit indices to evaluate the fit of the 
models, including the comparative fit index (CFI), the Tucker— 
Lewis index (TLI), the root-mean-square error of approximation 
(RMSEA), and the standardized root-mean-square residual 
(SRMR). Traditionally, values of CFI and TLI close to .95, values 
of RMSEA lower than .06, and values of SRMR lower than .08 
have been interpreted as indicating good fit (Browne & Cudeck, 
1993; Hu & Bentler, 1998; MacCallum, Browne, & Sugawara, 
1996). Following the rationale used by Trautwein et al. (2012), we 
considered a model to have reasonably good fit to the observed 
data when at least two of these criteria were fulfilled. However, it 
should be noted that these recommended cutoff values were orig- 
inally derived from analyses with relatively simple simulated data 
sets. These values are often not met with data sets derived from 
more complex studies, suggesting that they should be used with 
caution (Heene, Hilbert, Draxler, Ziegler, & Biihner, 2011; Marsh, 
Hau, & Wen, 2004). 

For comparing nested models (i.e., Models 1—4 and the uncon- 
strained vs. factorial invariance models for the latent boredom 
variable), we used the Satorra—Bentler scaled chi-square difference 
test including scaling corrections for nestedness (Bryant & Satorra, 
2012; Satorra, 2000), which is suited for use with the MLR 
estimator. This test provides the chi-square difference statistic TRd 
that is corrected for nonnormality of the observed variables and 
nestedness of the models. In addition, we evaluated the Akaike 
information criterion (AIC; Akaike, 1974) and the sample-size 


corrected Bayesian information criterion (BIC; Schwarz, 1978). 
Lower values of these criteria indicate better model fit. Using these 
criteria, the model with the smallest AIC and BIC should be 
chosen. For interpreting values of AIC and BIC, it is important to 
note that it is not the absolute size of the values but the difference 
between values which is relevant. For AIC, differences of AAIC > 
10 are considered as substantial and indicating that the model 
obtaining the higher value has essentially no empirical support 
(Burnham & Anderson, 2002). 


Results 


Preliminary Analyses 


Correlations. Correlations between boredom, interest, intrin- 
sic motivation, performance, and the background variables are 
outlined in Table 2. Correlations between the boredom measures (r 
range = .59-.81) and between the performance outcomes (r 
range = .56-.71) over time indicated a substantial degree of 
stability for both variables, with the highest correlations found 
between adjacent assessments. Furthermore, consistently negative 
correlations between boredom and performance were observed (r 
range = —.22 to —.36). Interest and intrinsic motivation correlated 
negatively with boredom, and interest correlated positively with 
performance on Test 1 and Test 2. Concerning the academic 
background variables, high-school grades were positively corre- 
lated with course performance (r range = .33-.41). 
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Table 2 
Pearson Product Moment Correlations for the Study Variables 
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Mean level change and gender effects on boredom. A 
repeated-measures ANOVA was conducted to assess boredom as a 
function of time and gender throughout the academic year. A 
significant main effect of time was found, F(4, 1216) = 16.19, p < 
.01, n5 = .05, showing boredom levels to increase over time 
(means are provided in Table 1). Gender also had a main effect on 
boredom, with males (M = 13.29, SD = 4.06) reporting greater 
boredom overall relative to females (M = 11.40, SD = 4.11), FC, 
304) = 13.58, p < .01, np = .04. 

Test of linearity for the achievement—boredom link. The 
propositions of the control-value theory (Pekrun, 2006) imply that 
the effects of achievement on boredom can take curvilinear forms, 
as noted earlier. For academic settings at university, however, we 
expected that these effects would be negatively linear. To test for 
linearity, we performed simultaneous multiple regression analysis 
for the links between test performance and subsequent boredom 
scores. The regression equations included linear and quadratic 
terms for performance which were computed after centering the 
performance variable. This was done separately for each test 
assessment that was followed by a boredom assessment, thus 
involving four analyses (for the effects of Tests 1, 2, 3, and 4 on 
Boredom Times 2, 3, 4, and 5, respectively). Across measurement 
occasions, test performance had a significant linear effect on 
subsequent boredom (Bs = —.33, —.31, -.34, and —.42 for the 
effects of Tests 1, 2, 3, and 4 on subsequent boredom; all ps < 
O01). 

There were no significant effects for the quadratic term in any of 
the equations, with one exception. In the Test 4/Boredom Time 5 
analysis, the effect of the quadratic term was significant (B = —.15, 
p < .01; unstandardized B = —.002). However, this effect was 
small relative to the effect of the linear term. An inspection of the 
regression function showed that the effect implied a decrease of 
the negative slope of the regression curve with increased perfor- 
mance (i.e., a slowing down of the decrease of boredom). There 
were no positive effects of performance on boredom at any interval 
of the regression curve. As such, the Test 4/Boredom Time 5 
regression function was monotonically negative and nearly linear. 
Overall, these findings show that achievement had negative effects 
on subsequent boredom that were linear, or approximately linear, 


for each of the study phases, corroborating our hypothesis that the 
effects of university students’ achievement on their boredom are 
simply negative and do not follow a U-shaped function. 


Results of Structural Equation Modeling 


Measurement invariance of the boredom scale. Confirmatory 
factor analysis was used to evaluate the measurement equivalence 
of the boredom measure across the five boredom assessments. We 
evaluated the strict factorial invariance model (Brown, 2006; Mer- 
edith, 1993) that provides a strong test of equivalence by con- 
straining factor loadings (metric invariance), item intercepts (sca- 
lar invariance), and item error variances (invariant uniquenesses) 
to be equal across measurement occasions. The model showed a 
good fit to the data, x (384) 559 37a Dr Ole Chip e975: 
TLI = .971; RMSEA = .033; SRMR = .040. The fit of an 
unconstrained baseline model that allowed parameters to vary 
across time was as follows: x7(320) = 478.85, p < .01; CFI = 
.977; TLI = .969; RMSEA = .034; SRMR = .033. Comparing the 
two models, the Satorra—Bentler scaled chi-square difference test 
including scaling corrections for nestedness was not significant 
(TRd [64] = 81.78, p > .05). Moreover, the AIC and the sample- 
size corrected BIC were lower for the strict factorial invariance 
model (AIC = 22,586.41 and 22,556.75, and BIC = 22,737.72 and 
22,652.72, for the unconstrained and factorial invariance models, 
respectively; AAIC = 29.66). This finding suggests that the strict 
factorial invariance model is preferable to the unconstrained model 
when using an information-theoretical perspective. Overall, these 
results clearly indicate that the strict factorial invariance model 
could be accepted, thus documenting that the boredom measure 
exhibited measurement invariance over time. 

Reciprocal effects model (Model 1). The reciprocal effects 
model included effects of boredom on performance and reverse 
effects of performance on boredom across all assessments of 
boredom and performance, as well as autoregressive effects and 
effects of the covariates (gender, age, and high school grades) on 
all of the boredom and performance variables (see Figure 1 and 
Table 3). The fit indices provided good support for this model, 
x7(548) = 1,136.05, p < .01; CFI = .931; TLI = .912; RMSEA = 
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Table 3 
Standardized Factor Loadings, Path Coefficients, and Residual Variances for Model I (Reciprocal Effects Model) 








Boredom Test performance 
Coefficient Time | Time 2 Time 3 Time 4 Time 5 Nesta! Test 2 Test 3 Test 4 Test 5 
Factor loadings 
Item | On oe" coe 719" oO Dias 
Item 2 ie 84" Foyle EOai Poole 
Item 3 .86"" Cie 88" .90™* soy 
Item 4 EO Ooi .66"* Am Ee 
Item 5 Toy le i on ES Oae Sua 
Item 6 One ie ee Noha Sie 
Path coefficients a 
Boredom Time n* Sie ae =.10" el li al 
Boredom Time n — 1° qa. a Oia Asi aS A 
Tesi = Ie Oi ll laa = 110 —,14™ 58" (phe 5 On Ave 
Gender roe =n) 5) nO == ()) .00 —.03 O01 03 Sani .08 
Age cml ie 00 ae SOS) —.04 03 NS O01 ail 12" 
High-School grades 02 .00 02 .05 .04 36r" Vin Baas li 24 
Residual variances .89** ee 4a D5 QO oom 48™ le Ot 4" 











* Effects of Boredom Times 1, 2, 3, 4, and 5 on Tests 1, 2, 3, 4, and 5, respectively. 
° Effects of Tests 1, 2, 3, and 4 on Boredom Times 2, 3, 4, and 5 and Tests 2, 3, 4, and 5, respectively. 


3, 4, and 5, respectively. 
ap 05s per. 018 


.050; SRMR = .063. The AIC and samples-size corrected BIC 
were 43,843.38 and 44,045.83, respectively. Both boredom and 
performance showed considerable stability over time, with autore- 
gressive coefficients for boredom in the B = .74-.84 range, and 
for exam scores in the 8 = .47-.61 range. Regarding the covari- 
ates, gender and age negatively predicted boredom at Time 1, and 
high-school grades positively predicted exam performance on 
Tests 1-5. 

Despite considerable stability of the performance variable and 
substantial effects of prior achievement on performance, results 
showed boredom to negatively predict each subsequent perfor- 
mance outcome, with the strongest path observed between the 
initial boredom and performance variables (8 range = —.10 
to —.23; see Figure 1). Negative paths from each test outcome to 


> Effects of Boredom Times 1, 2, 3, and 4 on Boredom Times 2, 


Fit indexes for the achievement effects model were as follows: 
x°(552) = 1,183.90, p < .01; CFI = .926; TLI = .906; RMSEA = 
052; SRMR = .076; AIC = 43,881.34; sample-size corrected 
BIC = 44,080.28. Both of these models fit the data significantly 
worse than the reciprocal effects model, with TRd (4) = 104.43, 
p <.01, for the boredom effects model, and TRd (4) = 60.73, p < 
.O1, for the achievement effects model. In addition, AIC and BIC 
were substantially higher than for the reciprocal effects model 
(AAIC = 35.82 and 37.96 for the boredom effects and achieve- 
ment effects models, respectively). These findings clearly indicate 
that the reciprocal effects model is preferable to these two unidi- 
rectional models. 

In the autoregressive model, all of the effects linking boredom 
and achievement were constrained to be zero. Fit indexes for this 


the subsequent boredom variable were also observed (8 model were as follows: x7(556) = 1,239.37, p < .01; CFI = .920;: 
range = —.10 to —.16), with three out of the four path coefficients TLI = .900; RMSEA = .054; SRMR = .118; AIC = 43,925.78; 
being significant. The effect of Test 3 on Boredom Time 4 sample-size corrected BIC = 44,121.21. The Satorra—Bentler 
(8 = —.10, ns) represented the influence of the last exam before scaled chi-square difference test showed that the model fit the data 


the winter break on boredom assessed after the winter break; the 
non-significance of the effect may have been due to the time lag 
and intervening events during this break. As such, the results 
provide empirical support for the study hypotheses in showing a 
consistent sequence of negative paths from boredom to perfor- 
mance, and vice versa, throughout the academic year. 
Comparison with the boredom effects, achievement effects 
and autoregressive models (Models 2-4). The unidirectional 
boredom effects and achievement effects models had the same 
structure as the reciprocal effects model, but some of the effects 
linking boredom and achievement were constrained to be zero. 
Specifically, the boredom effects model included effects of bore- 
dom on achievement, and the achievement effects model included 
effects of achievement on boredom, with reverse effects being 
constrained to zero in these models (see Figure 2). Fit indices for 
the boredom effects model were as follows: x°(552) = 1,189.20, 
p < .01; CFI = .926; TLI = .906; RMSEA = .052; SRMR = 
.096; AIC = 43,879.20; sample-size corrected BIC = 44,078.14. 


less well than the reciprocal effects model, TRd (8) = 140.68, p < 
.O1. In addition, AIC and BIC were substantially higher than for 
the reciprocal effects model (AAIC = 82.40), which also suggests 
poorer fit for the autoregressive model. These findings indicate 
that the reciprocal effects model is superior to the autoregressive 
model. In sum, the findings clearly indicate that the reciprocal 
effects model fit the data significantly better, and can be judged to 
be more likely given the data (see Burnham & Anderson, 2002, 
Chapter 2), as compared with any of three alternative models. 
Supplemental analysis: Controlling for interest and intrinsic 
motivation (Model 5). In a supplemental analysis, we expanded 
the reciprocal effects model by additionally controlling for stu- 
dents’ interest and intrinsic motivation assessed at Time 1. By 
testing whether the links between boredom and achievement were 
sufficiently robust when these related variables were included, we 
sought to address a potential concern that the boredom construct 
may simply be regarded as the inverse of interest or intrinsic 
motivation, as discussed at the outset. The fit indexes for this 
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model were similar to the indexes for the original Model 1, with 
x*(759) = 1,532.81, p < .01; CFI = .919; TLI = .900; RMSEA = 
049; SRMR = .063. Again, gender and age had negative effects 
on boredom at Time 1, and high-school grades had positive effects 
on performance on all of the course exams. Interest also had 
negative effects on boredom. Furthermore, interest had a positive 
effect on performance on Test | (see Appendix B for the estimated 
model parameters). 

Of critical importance, the path coefficients for the effects 
linking boredom and achievement replicated the coefficients of the 
original model. Again, all the effects of boredom on achievement 
were significantly negative, with Bs ranging from —.09 to —.25. The 
effects of achievement on boredom were negative as well, with Bs 
ranging from —.11 to —15 and three out of the four coefficients 
being significant. Again, there was one non-significant effect (Test 
3 on Boredom Time 4; B = —.11, ns) which pertained to the effect 
of test performance on students’ boredom across the winter break. 
In sum, these results show that the effects of boredom on achieve- 
ment, and the effects of achievement on boredom, were robust 
when controlling for students’ interest in the course and their 
intrinsic motivation. 


Discussion 


The findings of this study provide evidence for the proposed 
reciprocal effects model of boredom and achievement. As sug- 
gested by longitudinal structural equation modeling, university 
students’ course-related boredom had negative effects on their 
performance on subsequent course exams, and exam performance, 
in turn, had negative effects on subsequent boredom. The findings 
were consistent across all assessments of boredom and perfor- 
mance, except the link between the last exam taken prior to the 
winter break and boredom after the break which was negative but 
not significant. Alternative models including only effects of bore- 
dom on performance, or only effects of performance on boredom, 
also showed a reasonable fit to the data in terms of CFI and 
RMSEA; however, as indicated by chi-square difference tests and 
the comparison of AIC values, the reciprocal effects model clearly 
showed better fit than these alternative models. 

Because prior boredom and achievement as well as demo- 
graphic background variables were controlled, the path coeffi- 
cients are likely to represent effects of boredom on achievement, 
and vice versa, rather than simply the influence of prior boredom, 
prior achievement, gender, or age. This was further supported by 
supplemental findings showing the boredom—performance link to 
be robust when additionally controlling for students’ interest and 
intrinsic motivation. 

For interpreting the size of the path coefficients linking boredom 
and performance, it is important to note that the coefficients 
represent incremental effects due to prior boredom and achieve- 
ment being controlled. Thus, the coefficients represent effects of 
each variable on change in the other from one assessment to the 
next, rather than effects on the absolute levels of these variables. 
Furthermore, both boredom and performance showed considerable 
stability over time, leaving little variance to be explained and 
making it difficult to detect the effects of additional variables. 
From this perspective, the consistency of effects lends credibility 
to the notion that boredom and achievement are indeed linked by 
reciprocal causation over time. 


Reciprocal Effects of Boredom and Achievement 


The present findings add to the research literature by document- 
ing that boredom negatively predicts students’ scholastic attain- 
ment over time. This is congruent with previous evidence showing 
that boredom and academic achievement are negatively correlated, 
as summarized at the outset. However, the present findings go 
beyond correlational evidence by disentangling the directional 
effects underlying the boredom—achievement link. Specifically, 
the findings indicate that boredom indeed has a negative influence 
on students’ achievement, over and above the effects of prior 
accomplishments. These negative effects are in line with proposi- 
tions derived from Pekrun’s (2006) control-value theory, which 
posits that boredom has uniformly negative effects on learning and 
achievement outcomes. 

The results also contribute to our understanding of the origins of 
students’ boredom. The findings indicate that achievement, in turn, 
had negative effects on boredom, implying that successful com- 
pletion of exams can reduce students’ boredom, whereas doing 
poorly exacerbates their boredom. These effects are likely medi- 
ated by students’ perceptions of control over achievement, with 
low control leading to greater boredom (Pekrun et al., 2010). The 
findings of regression analyses further showed the links between 
achievement and subsequent boredom to be virtually linear in 
nature rather than representing U-shaped curvilinear effects. 

These negative effects of performance on boredom are counter 
to the accepted view that boredom is primarily experienced by 
gifted students who are not sufficiently challenged by academic 
demands. However, they are consistent with our earlier reasoning 
that university courses pose considerable challenges for many 
students, so that even the most capable students must struggle to 
retain perceived control and to master the material, contrary to the 
notion that success at university comes easy, due to boring and 
routine task demands. This may be especially true for 1st-year 
students, who are the focus of the present study. 

Taken together, these negative effects amount to positive feed- 
back loops linking the two constructs. In a few longitudinal stud- 
ies, previous research has found students’ test anxiety and aca- 
demic achievement to be linked by positive feedback loops 
(Meece, Wigfield, & Eccles, 1990; Pekrun, 1992). The present 
research adds to this literature by showing that boredom, an 
underexplored academic emotion, demonstrates similar links with 
performance. As such, it seems that unidirectional models cannot 
adequately capture the complex reality of students’ emotions. 
Rather, systems-oriented perspectives (Turner & Waugh, 2007) 
are needed that take more complex patterns of causal links into 
account, including feedback loops between emotions, their ante- 
cedents, and their effects. 


Effects of Interest and Intrinsic Motivation 


In our supplemental analyses, interest had negative effects on 
students’ boredom. The effects of intrinsic motivation on boredom 
were also negative, although most of them were not significant. 
Although a one-item measure cannot substitute for a more com- 
prehensive assessment of interest, the findings suggest that interest 
can protect against feeling bored, and that lack of interest can 
contribute to the arousal of boredom (Pekrun et al., 2010). In 
addition, interest had a positive effect on initial exam performance. 
The small size of this effect and the lack of positive performance 
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effects of intrinsic motivation are in line with previous research 
showing that interest and intrinsic motivation typically do not have 
a strong influence on students’ immediate academic performance; 
rather, they influence students’ long-term attainment and academic 
choices (Murayama, Pekrun, Lichtenfeld, & vom Hofe, 2013; 
Schiefele, 2009). To the extent that this is true, boredom and 
interest (or lack of interest) may have asymmetrical effects, with 
boredom immediately impacting learning and interest primarily 
having long-term effects. 


Development of Boredom Over Time 


Students’ boredom was found to increase over the academic 
year. This increase is equivalent with the development of boredom 
observed for middle and high school students (Ahmed et al., 2013; 
Pekrun et al., 2007), and with the decline of interest and intrinsic 
motivation that is observed during adolescence after students have 
entered middle school (e.g., Fredricks & Eccles, 2002; Frenzel, 
Goetz, Pekrun, & Watt, 2010). To the extent that there are com- 
mon mechanisms of affective change during young adulthood and 
adolescence, one explanation may be that boredom goes up, and 
enjoyment down, after the initial excitement experienced in a new 
educational environment has dissipated. 

A second possible explanation is provided by our reciprocal 
effects model. The feedback loops predicted by the model imply 
that there should be a symmetrical, self-sustaining development of 
boredom and achievement over time, with initial boredom reduc- 
ing subsequent achievement, and reduced achievement contribut- 
ing to increased boredom. Multiple cycles of this type should lead 
to a steady increase of average boredom scores across time, as 
observed in the present research. They should also lead to a 
reduction of achievement over time; however, the present analysis 
is not suited to examine this prediction, as exam scores at univer- 
sity lack a common metric due to variation in the contents and 
difficulty of exams across the academic year. 


Limitations, Suggestions for Future Research, and 
Implications for Practice 


The present investigation represents a significant advance over 
previous research by documenting reciprocal effects of boredom 
and achievement over time, while controlling for critical affective 
and demographic background variables. Nevertheless, several lim- 
itations should be considered when interpreting the study findings, 
and can be used to suggest directions for future research. 

Methodological considerations. The power of non- 
experimental field studies to derive causal conclusions is limited 
by the nature of their design. Clearly, non-experimental designs 
are less powerful for deriving causal conclusions than experimen- 
tal designs, all other things being equal. As such, although the 
present analysis used multi-wave longitudinal structural equation 
modeling and controlled for related variables and autoregressive 
effects, it cannot be completely ruled out that the study findings 
were due to other variables not included in the study. On the other 
hand, field studies of emotion may have more ecological validity 
than experimental studies that typically are limited in terms of 
situational representativeness (partially due to ethical limits on 
experimentally manipulating emotions): Emotion research takes 
no exception regarding the trade-off between internal and external 
validity that is typical of scientific inquiry in psychology. 


As such, the power of non-experimental field studies to derive 
conclusions regarding real-world causal processes in emotion may 
be limited due to threats to internal validity, whereas the power of 
experimental studies to derive such conclusions may be limited in 
terms of reduced external validity. By implication, future research 
should further pursue the approach taken herein but should also 
complement this approach with experimental studies on the link 
between boredom and students’ achievement. 

One specific methodological limitation of the present analysis is 
that achievement was modeled as a manifest variable. By using 
exam grades, we sought to employ an ecologically valid measure 
of student achievement. As is typical for grades, information about 
reliability was not available; as such, it was not possible to disat- 
tenuate the boredom—achievement link for potential unreliability 
of the achievement measure. From the perspective of hypothesis 
testing, this implies that our study hypotheses were tested in a 
conservative manner. With additional correction for measurement 
error in the achievement variables, the effects of boredom on 
achievement may have been even stronger than in the present 
analysis. 

However, there may also be an alternative perspective on the 
reliability of grades. As a measure of achievement, grades may 
have less than perfect reliability, implying that any effects of 
boredom on achievement may be underestimated. By contrast, 
from the perspective of grades as sources of students’ affective 
development, they could be seen as having almost perfect reliabil- 
ity, as grades rather than true achievement provide the feedback 
that shapes students’ perceptions of success and failure. 

Substantive issues. The present research examined academic 
boredom as experienced by university students. Compared with the 
general population, university students represent a select group 
having above-average ability and positive achievement experi- 
ences throughout prior stages of the educational career. It is open 
to question whether the present findings would generalize to 
low-ability adults, to younger age groups, and to the general 
student population in K—12 educational institutions. As the present 
research involved samples of North American students, it also 
remains an open question as to whether the findings would gen- 
eralize to students in other cultures. 

Regarding the origins of students’ boredom, the present findings 
indicate that poor academic achievement can contribute to the 
arousal of boredom. As such, the findings suggest that poor 
achievement and academic demands that are too challenging can 
trigger boredom in university students. Future research should 
examine under which task conditions, and in which students, 
boredom may also occur due to academic demands that are too low 
and fail to challenge students’ competencies, in line with Csik- 
szentmihalyi’s (1975) suggestions concerning the impact of low 
task demands (also see Daschmann, Goetz, & Stupnisky, 2011; 
Nett, Goetz, & Hall, 2011). To this end, it may be useful to 
examine boredom across various academic contexts and in stu- 
dents representing widely differing levels of ability, including 
gifted students who may more easily experience high levels of 
control over academic demands. 

Finally, the study addressed the overall relation between bore- 
dom and achievement but did not examine the mechanisms that 
may mediate the observed links. In the proposed model of recip- 
rocal effects, it is posited that effects of boredom on achievement 
are due to the detrimental influence of boredom on cognitive 
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resources, motivation, strategy use, and self-regulation, and that 
the effects of achievement outcomes on boredom are mediated by 
perceptions of control over performance. More research on the link 
between boredom and achievement as mediated by cognitive and 
motivational mechanisms is needed to better understand students’ 
boredom and to inform efforts to remediate its deleterious effects. 

Implications for educational practice. Two important mes- 
sages can be derived from the present research. First, the study 
results suggest that boredom has uniformly negative effects on 
students’ academic achievement, and these effects are not mere 
epiphenomena of prior performance: More likely, they represent a 
true negative causal influence of students’ boredom experiences. 
By implication, the findings suggest that educators, administrators, 
and parents alike may want to consider intensifying efforts that 
minimize students’ boredom, the relative inconspicuousness of this 
emotion notwithstanding. Second, the results imply that achieve- 
ment outcomes reciprocally influence students’ boredom, suggest- 
ing that successful performance attainment and positive achieve- 
ment feedback can contribute to a reduction of students’ boredom, 
whereas failure experiences can increase boredom (also see 
Pekrun, Cusack, Murayama, Elliot, & Thomas, 2014). Accord- 
ingly, providing students with experiences of success (e.g., in 
terms of mastery learning) may help to prevent the development of 
boredom. By documenting the influence of achievement outcomes 
on students’ boredom, the present findings elucidate one important 
factor that can be targeted by educators to reduce negative affect 
and thereby facilitate students’ academic development. 
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Appendix A 


Items of the Boredom Scale 


1. When studying for this course, I feel bored. 


2. The things I have to do for this course are often boring. 


3. The content is so boring that I often find myself daydreaming. 


4. When studying, my thoughts are everywhere else, except on the course material. 


5. Often I am not motivated to invest effort in this boring course. 


6. The material in this subject area is so boring that it makes me exhausted even to think 


about it. 


(Appendices continue) 
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Appendix B 
Standardized Factor Loadings, Path Coefficients, and Residual Variances for Model 5 





Boredom Test performance 
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Perfectionism and Motivation of Adolescents in Academic Contexts 


Mimi Bong, Arum Hwang, Arum Noh, and Sung-il Kim 


Korea University 


We examined the nature of self-oriented and socially prescribed perfectionism in relation to the 
motivation and achievement of 306 Korean 7th graders. We also tested the mediating role of domain- 
specific academic self-efficacy and achievement goals in the relationships between perfectionism and 
achievement-related outcomes across math and English. In the direct path model, self-oriented perfec- 
tionism related positively to academic achievement and negatively to acceptability of cheating and 
academic procrastination. Socially prescribed perfectionism, in contrast, related positively to test anxiety, 
acceptability of cheating, and academic procrastination. In the mediation models, self-oriented perfec- 
tionism related consistently and positively to academic self-efficacy, a mastery goal, and a performance- 
approach goal in the domain. Socially prescribed perfectionism related consistently and positively to a 
performance-approach and a performance-avoidance goal. Academic self-efficacy and a mastery goal 
mediated the paths from self-oriented perfectionism to acceptability of cheating, academic procrastina- 
tion, and achievement, while a performance-avoidance goal in English mediated the path from socially 
prescribed perfectionism to test anxiety. Many of the paths from perfectionism to outcomes were thus 
mediated by domain-specific motivation. The direct paths from the 2 perfectionism dimensions to 
academic procrastination remained significant, however, even in the presence of the intervening moti- 


vation variables. 


Keywords: perfectionism, self-efficacy, achievement goals, anxiety, procrastination 


Perfectionism refers to the personality trait of setting difficult 
goals and evaluating one’s own performance critically against 
these goals (Flett, Hewitt, & Dyck, 1989; Frost & Marten, 1990). 
Its strong associations with diverse symptoms of psychological 
maladjustment and disorders have made it the topic of extensive 
research in the past (Hewitt & Flett, 1991). There is reason to 
suspect, however, that perfectionism is a multidimensional con- 
struct and may not always be a harmful characteristic to possess 
(Frost, Marten, Lahart, & Rosenblate, 1990; Hewitt & Flett, 1991). 
For example, perfectionism influences goal-setting. Goals deter- 
mine the direction of behavior, strength of effort and persistence, 
and quality of final performance (Locke & Latham, 2002). The 
motivation and achievement of perfectionists, who strive for chal- 
lenging goals, would be different from those of nonperfectionists. 

Despite the apparent relevance to achievement striving, only a 
small number of studies to date have tested how perfectionism 
operates in academic settings. Recent evidence demonstrates that 
certain forms of perfectionism could prove beneficial in learning 
situations (see Fletcher & Speirs Neumeister, 2012, for review). 
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The literature, however, is yet to offer concrete answers to ques- 
tions such as how perfectionism relates to motivation in school, 
what the nature of relationships is between perfectionism and 
achievement-related outcomes, and which type of perfectionism is 
actually conducive to learning. The few available studies con- 
ducted in academic contexts have involved college students in 
North America (Mills & Blankstein, 2000; Verner-Filion & Gau- 
dreau, 2010), which further limits generalizability of the findings 
to adolescent populations and other cultures. 

We tried to address these issues by investigating the relation- 
ships between different types of perfectionism and indexes of 
academic motivation and performance in a group of Korean ado- 
lescent students. More broadly, we were interested in the extent 
domain-specific motivational beliefs mediated the effects of per- 
sonality dispositions on achievement-related outcomes. We first 
examined the dimensional characteristics of perfectionism in rela- 
tion to test anxiety, acceptability of cheating, academic procrasti- 
nation, and achievement, to explore the nature of multidimensional 
perfectionism in academic contexts. We then tested the role of 
academic self-efficacy and achievement goals as potential media- 
tors in these associations. 


Self-Oriented and Socially Prescribed Perfectionism 


Perfectionism as a Multidimensional Construct 


Frost et al. (1990) claimed that a unidimensional definition of 
perfectionism, as the tendency to set excessively high personal 
standards, cannot distinguish highly competent and successful 
“normal” perfectionists from “neurotic” perfectionists. They viewed 
perfectionism to be multidimensional, comprising six correlated 
dimensions: high personal standards, a concern over mistakes in 
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performance, feelings of doubt about quality of actions, valuing of 
parents’ expectations, apprehension of parents’ criticism of per- 
formance, and an overemphasis on organization, neatness, and 
order. All six dimensions, except for the organization dimension, 
correlated positively with fear of failure. While the concern over 
mistakes and doubts about actions dimensions correlated posi- 
tively with maladaptive symptoms such as depression, obsessive- 
compulsive disorder, feelings of guilt, and procrastination, the 
personal standards and organization dimensions did not. The per- 
sonal standards dimension even correlated negatively with pro- 
crastination and positively with self-efficacy. The parental expec- 
tations and parental criticism dimensions are now considered 
antecedents rather than functional dimensions of perfectionism 
(Fletcher, Shim, & Wang, 2012). 

Hewitt and Flett (1991) also viewed perfectionism as a multi- 
dimensional construct but with a different set of dimensions. They 
claimed that there are three dimensions of perfectionism that 
interact with different types of stressors to produce distinct out- 
comes: self-oriented, other-oriented, and socially prescribed per- 
fectionism. Self-oriented perfectionists impose high standards 
upon themselves, evaluate their own performance against these 
standards, and strive to perform flawlessly to meet these standards. 
Other-oriented perfectionists enforce high standards upon others, 
evaluate others’ performance against those standards, and insist 
others perform perfectly to meet those standards. These two di- 
mensions are differentiated from socially prescribed perfectionism 
based on who takes charge of setting the standards. Whereas self- 
and other-oriented perfectionists strive to satisfy, or demand others 
to satisfy, the standards that they generate, socially prescribed 
perfectionists strive to meet the standards that significant others, 
such as parents, impose on them (Stoeber, Feast, & Hayward, 
2009). 

Because the conceptualization of Hewitt and Flett (1991) deals 
with both intraindividual and interpersonal aspects of perfection- 
ism, it appears more pertinent to the study of children and adoles- 
cents than that of Frost et al. (1990). Among the three types of 
perfectionism, other-oriented perfectionism seems least relevant 
because children and adolescents are more often targets of other- 
oriented perfectionism than they are other-oriented perfectionists 
themselves. We were thus only interested in self-oriented and 
socially prescribed perfectionism in this research. 


Perfectionism in East Asian Cultures 


Because the present sample consisted of Korean adolescents, it 
is important to inspect at the outset features in East Asian cultures 
that may render the distinction between self-oriented and socially 
prescribed perfectionism particularly consequential. Collectivism 
is one such feature. Individuals in countries such as Korea, China, 
and Japan tend to embrace interdependent self-construal (Heine, 
2001; Markus & Kitayama, 1991; Oishi & Diener, 2001). They 
strive hard to maintain group harmony by paying keen attention to 
in-group members’ feelings, opinions, and actions, trying to please 
and to avoid displeasing significant others, and conforming to 
established norms and standards. For adolescents in East Asian 
cultures, judgments of success and failure in school would depend 
heavily on what parents, teachers, and society in general deem 
satisfactory. 


Another relevant feature in East Asian cultures is the high 
standards of academic excellence that parents impose on their 
child. In a study by Okagaki and Frensch (1998), Asian parents 
reported significantly higher “expected” as well as “ideal” educa- 
tional attainments for their child than did European American and 
Latino parents. Asian parents also displayed significantly stronger 
negative reactions to the hypothetical scenarios of their child 
receiving grades of B’s and C’s instead of A’s. At the same time, 
children in East Asian cultures have a strong sense of gratitude and 
indebtedness to their parents (Park & Kim, 2006). A strong sense 
of obligation coupled with high parental standards could increase 
socially prescribed perfectionism in Asian students. 

Castro and Rice (2003) reported that Asian American college 
students indeed scored significantly higher on Frost et al.’s (1990) 
perfectionism dimensions of parental criticism, concerns over mis- 
takes, and doubts about actions than did European and African 
American students. They also scored significantly higher on pa- 
rental expectations and personal standards than did European 
American students. The two parental dimensions are strong cor- 
relates of socially prescribed perfectionism (Flett, Sawatzsky, & 
Hewitt, 1995). Furthermore, perfectionism accounted for 27% of 
the variance in Asian American students’ grade-point averages 
(GPAs), compared to only 7% in European American students’ 
GPAs. Socially prescribed perfectionism is hence judged to be a 
particularly meaningful construct to examine in relation to Asian 
students’ school achievement. 


Perfectionism in Academic Contexts 


Perfectionism as a Predictor of Achievement-Related 
Outcomes 


Self-oriented and socially prescribed perfectionism typically 
demonstrate moderate to strong positive correlations to each other. 
Even so, socially prescribed perfectionism correlates with a 
broader array of psychological maladjustments than does self- 
oriented perfectionism. In Hewitt et al. (2002), for example, both 
types of perfectionism correlated positively with anxiety and de- 
pression. Socially prescribed perfectionism further correlated pos- 
itively with outward expression of anger and social stress and 
negatively with anger suppression. Based on these results, Hewitt 
et al. concluded that both self-oriented and socially prescribed 
perfectionism make children and adolescents vulnerable to malad- 
jJustment, albeit to differential degrees. 

A picture coming out of the academic domain, however, is 
somewhat different. Noting the unambiguous bearing of perfec- 
tionism on achievement behavior, several investigators have tried 
to unearth the psychological and behavioral profiles associated 
with each perfectionism dimension in academic settings. Because 
perfectionists’ striving to attain difficult goals often results in 
negative affect and counterproductive behavior (Einstein, Lovi- 
bond, & Gaston, 2000), variables such as anxiety and procrastina- 
tion, along with achievement, have been closely examined in 
relation to perfectionism. Anxiety and procrastination are major 
impediments to successful coping and performance (Steel, 2007; 
Zeidner, 1994). It is important to learn, therefore, if perfectionism 
actually elevates these negative psychological and behavioral re- 
sponses in achievement situations and, if so, why. 
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Unfortunately, relationships of the two perfectionism dimen- 
sions with anxiety and procrastination have been less than straight- 
forward. In a study by Mills and Blankstein (2000), both self- 
oriented and socially prescribed perfectionism correlated positively 
with test anxiety and extrinsic motivation, consistent with the 
observation of Hewitt et al. (2002). However, self-oriented per- 
fectionism in their study also correlated positively with adaptive 
motivation and learning process variables such as self-efficacy, 
task value, use of various cognitive strategies, and effective re- 
source management. Socially prescribed perfectionism did not 
correlate significantly or correlated negatively with these vari- 
ables. In Einstein et al. (2000), only socially prescribed perfec- 
tionism correlated with anxiety and depression, although both 
perfectionism dimensions correlated positively with stress. 

Procrastination is another variable frequently studied in relation 
to multidimensional perfectionism. Socially prescribed perfection- 
ism correlates with a general tendency of procrastination as well as 
academic procrastination (Flett, Blankstein, Hewitt, & Koledin, 
1992). Negative perfectionism, which is analogous to socially 
prescribed perfectionism, also correlates positively with academic 
procrastination, while positive perfectionism, which is analogous 
to self-oriented perfectionism, does not (Burns, Dittmann, Nguyen, 
& Mitchelson, 2000). Contrary to these findings, a meta-analysis 
by Steel (2007) showed that the perfectionism-procrastination cor- 
relation was negligible (r = —.03). This discrepancy likely owes 
to the definition of perfectionism in Steel’s review, which com- 
prised only of self- and other-oriented perfectionism. Socially 
prescribed perfectionism was classified as an index of fear of 
failure, which did correlate positively with procrastination (r = 
.18). Further empirical tests will help clarify the relationship 
between perfectionism and procrastination. 

Researchers have also been interested in the relationship be- 
tween perfectionism and cheating for obvious reasons. Cheating 
provides a means to attain an otherwise impossible goal. Earlier, 
we described that Asian parents put high academic demands on 
their child (Okagaki & Frensch, 1998), and Asian students, in turn, 
perceive high parental expectations and parental criticism (Castro 
& Rice, 2003) that are antecedents of socially prescribed perfec- 
tionism (Flett et al., 1995). High parental pressure functions as a 
source of conflict between Korean parents and children, which 
increases acceptability of cheating behavior among Korean ado- 
lescents (Bong, 2008). Still, the results on cheating have not been 
fully consistent, either. Vansteenkiste et al. (2010) found that the 
personal standards dimension of perfectionism correlated nega- 
tively with acceptability of cheating as well as actual cheating 
behavior. The concern over mistakes and doubts about actions 
dimensions correlated with neither of them. Nathanson, Paulhus, 
and Williams (2006), however, failed to find a significant relation- 
ship between self-oriented or socially prescribed perfectionism and 
cheating behavior. 

When it comes to academic performance, being self-oriented 
perfectionists clearly helps. Bieling, Israeli, Smith, and Antony 
(2003) reported that adaptive perfectionism, which included self- 
oriented perfectionism, correlated positively with both positive and 
negative affect toward the recent exam, future plans to study more, 
grade goals for the current and future exams, and actual exam 
performance. Maladaptive perfectionism, which included socially 
prescribed perfectionism, also correlated positively with negative 
affect toward the exam. However, unlike adaptive perfectionism, it 


correlated negatively with positive affect toward the exam or exam 
preparedness. Other studies similarly depict the performance ben- 
efits of self-oriented perfectionism. Stoeber and Rambow (2007) 
observed that students with an adaptive form of perfectionism 
attained significantly higher academic achievement compared to 
those with a maladaptive form of perfectionism. Verner-Filion and 
Gaudreau (2010) also reported that self-oriented perfectionism 
positively predicted academic satisfaction and grade point aver- 
ages for college students, whereas socially prescribed perfection- 
ism negatively predicted them. 

Given the negative effect anxiety has on performance, it is 
puzzling that self-oriented perfectionism, which often correlates 
positively with anxiety, enhances performance. The answer may 
come from the trait-state distinction. When Zeidner (1994) exam- 
ined the relationships between multiple components of trait anxi- 
ety and state anxiety, only social evaluation trait anxiety predicted 
state anxiety before the exam, directly and indirectly via academic 
stress. In Mills and Blankstein’s (2000) study described earlier, 
self-oriented perfectionism no longer correlated with test anxiety, 
when its covariance with socially prescribed perfectionism was 
controlled for. These results suggest that socially prescribed per- 
fectionism increases state anxiety, while self-oriented perfection- 
ism does not, even though both correlate with trait anxiety (Flett, 
Hewitt, Endler, & Tassone, 1994). This conjecture requires a 
mediating process that weakens the link of self-oriented perfec- 
tionism to state anxiety, which we describe in the next section. 

To summarize, self-oriented perfectionism demonstrates posi- 
tive associations with academic achievement and null or negative 
associations with test anxiety, academic procrastination, and ac- 
ceptability of cheating. Socially prescribed perfectionism, on the 
contrary, demonstrates negative associations with achievement and 
positive associations with detrimental indexes in academic set- 
tings. 


Academic Motivation as a Mediator Between 
Perfectionism and Outcomes 


Self-oriented perfectionism, therefore, appears to play at least a 
more positive than negative function in the learning process. 
However, more evidence is needed to conclude that it is indeed an 
adaptive form of perfectionism for learners in academic contexts. 
In addition, most of the few available studies simply contrasted 
relationships of the two perfectionism dimensions with various 
outcomes without probing why they were associated with different 
outcomes or with the same outcomes in different manners. 

Miquelon, Vallerand, Grouzet, and Cardinal (2005) argued that 
failure to integrate mediating motivational processes in the rela- 
tionships between perfectionism and outcomes has been responsi- 
ble for the ambiguous effects associated with self-oriented perfec- 
tionism. In their study, self-oriented and socially prescribed 
perfectionism for college students correlated positively with each 
other as well as with neuroticism (Study 2). When motivational 
constructs were incorporated as mediators in path analysis, how- 
ever, the two displayed completely different predictive patterns. 
Self-oriented perfectionism positively predicted self-determined 
academic motivation, which in turn positively predicted academic 
adjustment and negatively predicted psychological adjustment dif- 
ficulties. Socially prescribed perfectionism, on the contrary, pos- 
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itively predicted non-self-determined academic motivation, which 
in turn positively predicted psychological adjustment difficulties. 

Seo (2008) also tested the mediating role of academic self- 
efficacy in the relationship of self-oriented perfectionism with 
academic procrastination. Self-oriented perfectionism correlated 
positively with self-efficacy and negatively with procrastination. 
More important, academic self-efficacy fully mediated the rela- 
tionship between self-oriented perfectionism and academic pro- 
crastination. These studies corroborate the adaptive nature of self- 
oriented perfectionism and illustrate the mediating role academic 
motivation plays in the perfectionism-outcome links. 


Perfectionism and Academic Motivation 


Perfectionism and Academic Self-Efficacy 


Among many motivational constructs, one plausible mediator 
between perfectionism and learning outcomes is academic self- 
efficacy. Self-efficacy represents subjective convictions for suc- 
cessfully carrying out courses of action to achieve desired out- 
comes (Bandura, 1977). Beliefs of self-efficacy are tailored to 
particular tasks, activities, or domains of functioning. Academic 
self-efficacy, therefore, refers to learners’ subjective convictions 
for successfully performing specific academic tasks at designated 
levels (Schunk, 1991). The central role academic self-efficacy 
plays in determining the strength of motivation and quality of 
achievement-related outcomes in so many different settings and 
subject areas (Multon, Brown, & Lent, 1991; Pajares, 1996) 
strongly suggests that self-efficacy functions as a mediator be- 
tween stable personality characteristics such as perfectionism and 
outcomes in specific learning contexts. 

Perfectionism and academic self-efficacy would most likely be 
intertwined with each other through the psychological mechanisms 
of goal-setting and self-evaluation. Self-oriented perfectionism, by 
definition, involves setting high goals and striving to attain them 
(Hewitt & Flett, 1991). Bandura (1997) asserted that acts of setting 
and pursuing challenging personal goals and aspirations foster 
development of self-efficacy, a claim that has received strong 
empirical support from both self-efficacy and goal-setting litera- 
tures. As individuals pursue higher goals, their self-efficacy and 
performance improve correspondingly (Locke & Latham, 2002). 
Academic self-efficacy mediates the connection between goals and 
eventual performance as students work toward their goals, moni- 
toring their progress and developing necessary skills (Schunk, 
1996). Self-efficacy of learners is best promoted when they set 
challenging goals and engage in frequent self-evaluations of their 
goal progress (Schunk & Ertmer, 1999). 

Self-oriented perfectionists, who are in pursuit of difficult self- 
set goals, would be vigilant about assessing their performance 
against these goals because goals also serve as standards with 
which to evaluate performance (Locke & Latham, 2002). Accom- 
plishment of proximal subgoals while striving to achieve the 
difficult final goal provides these perfectionistic learners with 
mastery experiences, which constitute the most potent source of 
self-efficacy information (Bandura, 1977). The end product is a 
stronger sense of self-efficacy, accompanied by intrinsic interest, 
self-satisfaction, and enhanced performance (Bandura & Schunk, 
1981). 


Trying to satisfy difficult standards is a hallmark of not only 
self-oriented perfectionism but also socially prescribed perfection- 
ism. Yet socially prescribed perfectionists do not necessarily enjoy 
the profit of goal pursuit in the form of improved self-efficacy. 
According to Latham and Locke (1991), the effects of goal-setting 
are moderated by the degree of goal commitment. For individuals 
who are high in goal commitment, performance improves linearly 
with goal difficulty, presumably with the help of augmented per- 
cepts of self-efficacy. For those who are low in goal commitment, 
however, performance shows no systematic relationship with goal 
difficulty. Goal commitment is higher when the goals are attain- 
able and individuals participate in setting them, compared to when 
they are impossible to attain and assigned by others. Socially 
prescribed perfectionists strive to fulfill excessively high standards 
that are imposed by others. Goal commitment of socially pre- 
scribed perfectionists, therefore, would not be as strong as that of 
self-oriented perfectionists and the weaker goal commitment com- 
promises the self-efficacy benefit they should otherwise reap from 
their enactive mastery experiences. 

Supporting these conjectures, Mills and Blankstein (2000) ob- 
served that self-oriented perfectionism demonstrated a positive 
correlation with academic self-efficacy for learning and perfor- 
mance in an introductory psychology course. Socially prescribed 
perfectionism exhibited a nonsignificant correlation with academic 
self-efficacy, which became significant and negative when only 
the unique variance was considered. Similarly, Van Yperen (2006) 
reported that a subdimension of self-oriented perfectionism, the 
importance of being perfect, correlated positively with self- 
efficacy, while a subdimension of socially prescribed perfection- 
ism, others’ high standards, correlated negatively with it. 

Evidence of mediation by self-efficacy was also observed in a 
study by Dunkley, Zuroff, and Blankstein (2003). The researchers 
hypothesized that self-blame, lower self-efficacy, and perceived 
criticism from others would mediate the relationship between 
self-critical perfectionism and avoidant coping. Personal standards 
perfectionism, an adaptive form of perfectionism that included 
self-oriented perfectionism, was distinguished from self-critical 
perfectionism, a maladaptive form of perfectionism that included 
socially prescribed perfectionism. Only self-critical perfectionism 
displayed a significant negative correlation with self-efficacy. Sup- 
porting the authors’ hypothesis, higher self-critical perfectionism 
predicted lower self-efficacy, which in turn predicted greater 
avoidant coping in the form of denial and disengagement from 
stressful events. As the latter two studies assessed self-efficacy for 
life events, a direct test of academic self-efficacy as a mediator 
between perfectionism and learning outcomes is required. 


Perfectionism and Achievement Goals 


Achievement goals have received even greater attention than 
academic self-efficacy has as potential mediators of perfectionism- 
outcome relationships in the academic domain. Achievement goals 
represent underlying purposes of achievement-related behavior in 
specific achievement situations (Dweck & Leggett, 1988). Al- 
though disagreement exists on the exact definition and functions of 
each achievement goal, a mastery goal has emerged as a positive 
predictor of interest, while a performance-approach goal of pur- 
suing normative competence has emerged as a positive predictor of 
performance. A performance-avoidance goal of avoiding norma- 


PERFECTIONISM AND MOTIVATION ls 


tive incompetence has been a consistent negative predictor of both 
outcomes (Hulleman, Schrager, Bodmann, & Harackiewicz, 
2010). 

A number of parallels between the literatures on perfectionism 
and achievement goals delineate how these constructs may be 
relevant to each other. Self-oriented perfectionism arises from a 
motive to achieve, while socially prescribed perfectionism results 
from fear of failure (Speirs Neumeister, 2004). An achievement 
motive is also an antecedent of mastery and performance-approach 
goals, while fear of failure is an antecedent of performance- 
approach and performance-avoidance goals (Elliot & Church, 
1997). Sharing the same motive, self-oriented perfectionists would 
more likely pursue the two approach-oriented goals, whereas so- 
cially prescribed perfectionists would more likely pursue the two 
performance-oriented goals. 

Speirs Neumeister (2004) interviewed gifted college students 
identified as either self-oriented or socially prescribed perfection- 
ists and found support for the hypothesized links. Self-oriented 
perfectionists were driven by a strong achievement motive and 
adopted either a mastery goal of learning new things and improv- 
ing oneself or a performance-approach goal of doing better than 
others. These students sought out challenging academic tasks and 
prepared far in advance for assignments and exams. Socially 
prescribed perfectionists, in comparison, were driven by a strong 
fear of failure and adopted either a performance-approach goal of 
validating one’s ability or a performance-avoidance goal of avoid- 
ing doing worse than others. These students procrastinated to 
exonerate themselves from the implications of potential failure. 

The two perfectionism dimensions exhibited different associa- 
tions with measures of self-criticism as well (Trumpeter, Watson, 
& O’Leary, 2006). On the one hand, both perfectionism dimen- 
sions correlated positively with internalized self-criticism, a neg- 
ative evaluation of the self due to failure to meet self-set standards. 
On the other hand, only socially prescribed perfectionism corre- 
lated positively with comparative self-criticism, a negative evalu- 
ation of the self due to failure to perform as well as others. These 
findings suggest that the nature of competence evaluation associ- 
ated with each perfectionism dimension might provide another 
mechanism through which perfectionism promotes particular 
achievement goals. Self-oriented perfectionists, who evaluate their 
performance solely against self-set standards, would likely be 
drawn to a mastery goal, which defines competence in an absolute 
sense in reference to goals and standards (Elliot & McGregor, 
2001; Pintrich, 2000). Socially prescribed perfectionists, who eval- 
uate their performance against that of others, would find 
performance-approach and performance-avoidance goals more at- 
tractive because the normative definition of competence (Elliot & 
McGregor, 2001; Pintrich, 2000) aligns well with the way socially 
prescribed perfectionists evaluate competence. 

Verner-Filion and Gaudreau (2010) not only replicated the 
proposed links between perfectionism and achievement goals 
for college students but also presented evidence of mediation by 
achievement goals. They assessed perfectionism and achieve- 
ment goals before midterm exams and academic satisfaction 
and grade point averages after midterm exams. Self-oriented 
perfectionism positively linked to mastery, performance-approach, 
and performance-avoidance goals. It also positively predicted ac- 
ademic satisfaction and grade point averages. Socially prescribed 
perfectionism negatively linked to mastery goals and positively 


linked to performance-approach and performance-avoidance goals. 
It negatively predicted academic satisfaction and grade point ay- 
erages. The paths between perfectionism and academic satisfaction 
were mediated by a mastery goal and those between perfectionism 
and grade point averages were mediated by a performance- 
approach goal. 

More specifically, as students’ self-oriented perfectionism be- 
came stronger, they were better positioned to experience improved 
academic satisfaction and higher grade point averages. Adoption 
of a mastery goal provided one channel through which self- 
oriented perfectionism resulted in increased academic satisfaction, 
while adoption of a performance-approach goal resulted in higher 
academic achievement for self-oriented perfectionism. Quite the 
contrary, as students’ socially prescribed perfectionism became 
stronger, they experienced decreased academic satisfaction and 
lower academic achievement. Socially prescribed perfectionism 
made pursuit of a mastery goal less likely, which partly explained 
why it was associated with reduced academic satisfaction. How- 
ever, socially prescribed perfectionism also meant a greater like- 
lihood of adopting a performance-approach goal, which predicted 
higher, not lower, subsequent grade point averages. These results 
illustrate complex routes by which different types of perfectionism 
connect to achievement-related outcomes and highlight the bene- 
fits of incorporating motivational variables such as achievement 
goals in the relationships between perfectionism and learning 
outcomes. 


Self-Efficacy and Achievement Goals as Predictors of 
Achievement-Related Outcomes 


In our review of the literature presented earlier, we focused on 
test anxiety, academic procrastination, cheating, and academic 
performance among diverse outcomes that perfectionism predicts 
in learning situations, due to their direct implications for students’ 
achievement striving. These achievement-related outcomes are 
also the ones that self-oriented and socially prescribed perfection- 
ism have demonstrated contrasting associations in past research. 
Evidence diverges on several of these relationships, however, 
which would benefit from additional empirical tests. The relation- 
ships that self-efficacy and each of the achievement goals display 
with the same outcomes, in comparison, have been far more 
consistent. 

According to the extant literature, academic self-efficacy is a 
negative predictor of test anxiety (Bandalos, Finney, & Geske, 
2003) and academic procrastination (Steel, 2007; Wolters, 2003, 
2004) and a positive predictor of achievement (Bong, 2005; Wolt- 
ers, 2003, 2004). A mastery goal is a negative predictor of test 
anxiety (Bandalos et al., 2003), acceptability of cheating (Mur- 
dock, Miller, & Kohlhardt, 2004), and academic procrastination 
(Wolters, 2004), while a performance-approach goal is a positive 
predictor of acceptability of cheating (Murdock et al., 2004) and 
achievement (Daniels et al., 2009; Hulleman et al., 2010; Wolters, 
2004). In Murdock et al. (2004), performance-approach and 
performance-avoidance goals formed a single factor. It is possible, 
therefore, that the avoidance component was primarily responsible 
for the positive path from the performance goal to acceptability of 
cheating in their study. However, Anderman, Griesinger, and 
Westerfield (1998) showed that an extrinsic goal, a variant of a 
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performance-approach goal, was also a significant positive predic- 
tor of acceptability of cheating. 

The link between a performance-approach goal and anxiety is 
mixed, with early studies that did not distinguish between ap- 
proach and avoidance components reporting a positive path (Ban- 
dalos et al., 2003; Daniels et al., 2009) and later studies reporting 
a nonsignificant path (Sideridis, 2005). A performance-avoidance 
goal is a positive predictor of anxiety (Pekrun, Elliot, & Maier, 
2006; Sideridis, 2005), acceptability of cheating (Bong, 2008; 
Murdock et al., 2004), and academic procrastination (Wolters, 
2004) and a negative predictor of achievement (Bong, 2005; Hul- 
leman et al., 2010; Sideridis, 2005). 

Regarding the relationships among the motivational constructs, 
different opinions exist in the literature regarding the causal pre- 
cedence between academic self-efficacy and achievement goals. 
Dweck and Leggett (1988) viewed perceived competence mainly 
as a moderator of the achievement goal effects. Others treat self- 
efficacy as an outcome of achievement goals (e.g., Middleton & 
Midgley, 1997). Self-efficacy theorists maintain that a core com- 
ponent of self-efficacy is perceived competence (Bong & Skaalvik, 
2003; Schunk & Pajares, 2005), which achievement goal theorists 
recognize as an antecedent of all achievement goals (e.g., Elliot, & 
Church, 1997). From a theoretical standpoint, then, it is most 
plausible to regard self-efficacy as causally predominant to achieve- 
ment goals. Supporting this conjecture, changes in the academic 
self-efficacy of Korean high school students predicted changes in their 
subsequent achievement goals but not the other way around (Bong, 
2005). Further, self-efficacy in the preceding semester was a signifi- 
cant predictor of the mastery and performance-avoidance goals in the 
following semester but not of the performance-approach goal. Based 
on these findings, we expected academic self-efficacy to precede 
achievement goals in our model. 


Present Hypotheses 


We tried to address two primary research questions in this study: 
(a) Is self-oriented perfectionism adaptive, and socially prescribed 


perfectionism maladaptive, in the academic domain? and (b) Do 
achievement goals and academic self-efficacy mediate the rela- 
tionships between perfectionism and achievement-related out- 
comes? Assuming that the answer to Question b is yes, we were 
also interested in uncovering the nature of the mediation by each 
motivational construct in the perfectionism-outcome associations. 

To answer Question a, which would help ascertain the dimen- 
sional characteristics of perfectionism, we tested the direct rela- 
tionships between perfectionism and outcome variables. The left 
panel of Figure 1 presents the hypothesized paths. The paths that 
have consistent support in the literature or were expected on 
theoretical grounds are indicated with solid lines, whereas those 
that lack consistent support, with the possibility of a null relation- 
ship, are indicated with dotted lines. To answer Question b, which 
would substantiate the hypothesized mediation of the perfection- 
ism effects by motivational constructs and clarify the nature of 
such mediation, we tested the indirect relationships between per- 
fectionism and outcome variables via academic self-efficacy and 
achievement goals. The right panel of Figure 1 presents the hy- 
pothesized mediation. Again, consistency of theoretical and em- 
pirical support in the literature for each hypothesized path is 
indicated by solid and dotted lines. 

Our main interest in this research was in the mediation model. 
If perfectionism maintains its direct links to the various achievement- 
related outcomes in the model, it may mean that the effects of 
stable personality characteristics on academic outcomes are too 
strong to be mediated by domain-specific motivation. Although 
such results would be inconsistent with the contemporary literature 
on motivation (e.g., Elliot & Church, 1997), perfectionism might 
just be one exception to this trend. Conversely, if domain-specific 
motivational constructs successfully mediate the paths between 
perfectionism and outcomes, this would once again highlight the 
functional centrality of motivational beliefs in determining the 
learning outcomes in specific achievement situations (e.g., Pajares, 
1996). 





Figure 1. Hypothesized positive (+) and negative (—) paths and mediation by academic self-efficacy and 
achievement goals. Dotted paths indicate a possibility of nonsignificant relationships. Shaded boxes indicate 
variables assessed in reference to math and English. SOP = self-oriented perfectionism; SPP = socially 
prescribed perfectionism; SE = academic self-efficacy; MAP = mastery goal; PAP = performance-approach 
goal; PAV = performance-avoidance goal; CHT = acceptability of cheating; ANX = test anxiety; PROC = 
academic procrastination; ACH = achievement scores. 
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By testing complex interrelationships between perfectionism, 
motivation, and major achievement-related outcomes, we were 
also hoping that the results might shed light on the nature of not 
only perfectionism but also a performance-approach goal. Cur- 
rently, a performance-approach goal is associated with mixed 
effects but the reasons behind its positive and negative effects have 
not been clearly understood (Senko & Harackiewicz, 2005). The 
way with which each perfectionism dimension predisposes stu- 
dents to adopt a particular achievement goal in academic settings 
could allow us to generate inferences regarding one such mecha- 
nism. 


Method 


Participants and Procedures 


Data were collected from 306 seventh graders attending a public 
middle school in a metropolitan city near Seoul, Korea. This 
school serves middle-income families and is large in scale with 10 
to 12 classes at each grade-level. Ages of the participants ranged 
from 12 years and 5 months to 13 years and 4 months at the time 
of the survey. Education in 6 years of elementary school and 3 
years of middle school is compulsory in Korea. The seventh grade 
marks the first year after the transition to middle school. 

Middle school students take a nationwide, standardized compe- 
tency test at the end of their senior year, which they must pass to 
advance to academic-track high schools. Scores on this test also 
determine their eligibility to enter select high schools. Flett et al. 
(1994) showed that relationships of self-oriented and socially 
prescribed perfectionism with other variables change depending on 
the degree of perceived evaluative threat. Because we were inter- 
ested in the function of perfectionism under normal academic 
circumstances, seventh graders who were yet to experience ele- 
vated test stress were deemed an appropriate target for this re- 
search. 

Surveys were administered during regular classroom hours, 
several days before final exams. We assured students of confiden- 
tiality of their responses. Data from 304 students (148 girls, 156 
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boys) were analyzed, excluding two students with too many miss- 
ing responses. 


Measures 


Students responded to items on a 5-point Likert-type scale with 
1 indicating strongly disagree and 5 strongly agree for all but the 
academic procrastination scale, which had | indicating never pro- 
crastinate and 5 always procrastinate. All scales had been trans- 
lated and validated in Korean in previous research (see below). 
The Cronbach’s as obtained in the present study are reported in 
Table 1. 

Academic self-efficacy and achievement goals were assessed in 
reference to the specific academic subjects of math and English 
because, (a) whereas academic motivation is generally domain- 
specific, academic self-efficacy (Bong, 1997; Pajares, 1996) and 
achievement goals (Bong, 2001) contain particularly strong 
domain-specific components; (b) motivation in math and that in 
English are distinct from each other (Bong, 1997; Marsh, Byrne, & 
Shavelson, 1988); and (c) testing the hypothesized mediation 
across two discrete subject matter areas would help ascertain 
generalizability of the hypothesized mediation. 

Self-oriented and socially prescribed perfectionism. We 
used the Multidimensional Perfectionism Scale (MPS) by Hewitt 
and Flett (1991). The scale contains 15 items assessing self- 
oriented perfectionism (e.g., “I demand nothing less than perfec- 
tion of myself’) and another 15 assessing socially prescribed 
perfectionism (e.g., “The people around me expect me to succeed 
at everything I do”). Both scales had demonstrated satisfactory 
internal consistency with as above .85 in past research (Flett et al., 
1995; Miquelon et al., 2005; Stoeber et al., 2009). The translated 
versions had functioned well among Korean college students as 
well (Seo & Synn, 2006; as = .88 for self-oriented perfectionism 
and .77 for socially prescribed perfectionism). 

Academic self-efficacy. We used five items in the self- 
efficacy subscale of the Motivated Strategies for Learning Ques- 
tionnaire (MSLQ; Pintrich & De Groot, 1990) to assess academic 
self-efficacy (e.g., “I’m certain I can understand the ideas taught in 


a el 





Variable M SD a 
Self-oriented perfectionism 3.44 0.60 82 
Socially prescribed perfectionism B25 0.64 we 
Test anxiety 3-59) 1.03 19 
Acceptability of cheating 1.81 0.89 67 
Academic procrastination 2.95 0.78 .63 
Academic self-efficacy in math 3.12 Le 94 
Academic self-efficacy in English 3.45 1.10 Sy 
Mastery goal in math 4.00 0.95 80 
Mastery goal in English 4.02 0.85 io 
Performance-approach goal in math 325 ee 90 
Performance-approach goal in English 3223 ee 86 
Performance-avoidance goal in math 2.69 1.18 80 
Performance-avoidance goal in English Dio) 1.10 16 
Achievement score in math 57.44 Dlg 
Achievement score in English 69.83 24.48 


Range Min. observed Max. observed Skewness Kurtosis 
1-5 1.67 4.93 0.05 = (27 
1-5 Lea 4.90 0.07 = ED 
1-5 1.00 5.00 —0.40 —0.58 
1-5 1.00 5.00 1.09 0.67 
1-5 1.00 5.00 0.20 —0.10 
1-5 1.00 5.00 0.01 —1.00 
1-5 1.00 5.00 —0.40 —0.70 
1-5 1.00 5.00 el O5 1.06 
1-5 1.00 5.00 —0.94 1.02 
1-5 1.00 5.00 =O) —0.81 
1-5 1.00 5.00 —0.28 —0.60 
1-5 1.00 5.00 0.12 —0.94 
1-S 1.00 5.00 0.23 —0.68 
0-100 3.80 100.00 —0.10 =) 
0-100 12.00 100.00 —0.74 =():99 





Note. Min. = minimum; Max. = maximum. N = 304. 
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my subject class”). The translated version had demonstrated as 
above .88 in various samples of Korean middle and high school 
students and correlated significantly with achievement goals, test 
anxiety, strategy use, perceived classroom goal structures, and 
academic achievement (Bong, 2005, 2008, 2009). 

Achievement goals. Nine items were adopted from the 
achievement goal scale used in Elliot and McGregor (2001), which 
has three items each for mastery (e.g., “I want to learn as much as 
possible from this subject class”), performance-approach (e.g., “It 
is important for me to do better than other students in this subject 
class”), and performance-avoidance goals (e.g., “I just want to 
avoid doing poorly in this subject class”). The achievement goal 
scale had displayed as between .77 and .91 for the mastery goal, 
.90 and .94 for the performance-approach goal, and .73 and .89 for 
the performance-avoidance goal in their study. Scores on the scales 
had correlated significantly with motive dispositions, implicit the- 
ories of ability, perceived competence, anxiety, study strategies, 
and academic performance in past research (Cury, Elliot, Da 
Fonseca, & Moller, 2006; Elliot & McGregor, 2001; Pekrun et al., 
2006). The translated scales had shown as ranging from .74 to .84, 
.61 to .92, and .65 to .78 for the mastery, performance-approach, 
and performance-avoidance goals, respectively, among Korean 
adolescents in various school subjects (Bong, 2005, 2008, 2009). 

Test anxiety. We used the six-item test anxiety subscale of the 
MSLQ (Duncan & McKeachie, 2005; e.g., “When I take a test, I 
think about how poorly I am doing compared with other stu- 
dents”). The scale had proven internally consistent with a = .80 
among college students (Mills & Blankstein, 2000; Pintrich, 
Smith, Garcia, & McKeachie, 1993). In previous research with 
sixth graders in the United States (Middleton & Midgley, 1997), 
the internal consistency estimate of this scale dropped to .68. The 
translated version had demonstrated a similar degree of internal 
consistency among Korean middle school students with a = .63 
(Bong, 2009). 

Acceptability of cheating. Three items in a scale used by 
Anderman et al. (1998) were modified to investigate cheating on 
tests (e.g., “Is it okay to cheat on tests?”) because cheating is most 
frequently discussed in test-taking contexts in Korea. The scale 
had displayed acceptable internal consistency with as above .64 in 
U.S. middle and high school samples (Anderman et al., 1998; 
Murdock et al., 2004). The translated version had shown similar 
internal consistency estimates with a = .64 among Korean middle 
school students in past research (Hwang, 2010). 

Academic procrastination. We administered the Procrastina- 
tion Assessment Scale—Student (PASS; Solomon & Rothblum, 
1984) that assesses the frequency (e.g., “Studying for exams: To 
what degree do you procrastinate on this task?”) and perceived 
severity of academic procrastination (e.g., “To what degree is 
procrastination on this task a problem for you?”). We used only the 
frequency items in this study because the severity items behaved 
differently (As = .20). The scale had exhibited internal consistency 
estimates above .75 and correlated significantly with perfection- 
ism, anxiety, and grade point averages in past research (Fritzsche, 
Young, & Hickson, 2003; Howell, Watson, Powell, & Buro, 2006; 
Milgram, Marshevsky, Sadeh, 1995). The translated version had 
demonstrated « = .87 among Korean college students (Synn, Park, 
& Seo, 2005). We only used tasks that are applicable to middle 
school contexts and modified the descriptions to make them more 
meaningful to middle school students. These tasks included com- 


pleting homework (revised from “writing a term paper”), studying 
for exams, and keeping up weekly class materials (revised from 
“keeping up weekly reading assignments”). 

Academic achievement. Final exam scores in math and Eng- 
lish served as indexes of academic achievement. Scores on these 
exams could range from 0 to 100. 


Overview of Analysis 


Because our sample size was considered small for the number of 
parameters to be estimated in the model, we applied a three-stage 
approach for reducing the participants-to-parameters ratio to an 
acceptable level, so as not to obtain nonconvergent or improper 
solutions (Anderson & Gerbing, 1984). We first performed pre- 
liminary confirmatory factor analyses (CFAs) per construct. For 
evaluating model fit, we used several fit indexes in addition to the 
chi-square statistics, which are known to be sensitive to sample 
size. We applied the Tucker—Lewis index (TLI) greater than .90 
(Bentler, 1990; Tucker & Lewis, 1973), the comparative fit index 
(CFI) greater than .95 (Hu & Bentler, 1999), and the root-mean- 
square error of approximation (RMSEA) less than .08 (Browne & 
Cudeck, 1993) as cutoff criteria for acceptable model fit. 

Using the factor loadings, factor variances, and error variances 
and covariances from these models, we computed factor rho co- 
efficients as reliability estimates (Raykov, 1997, 2004). We then 
created a reliability-driven composite score for each latent variable 
(Bentler, 2009) by fixing the error variance with a formula, (1 — 
scale reliability) < scale variance (Hayduk, 1987). Using these 
composite scores, corrected for unreliability, in subsequent anal- 
yses substantially reduced the number of parameters to be esti- 
mated to a level appropriate to our sample size. 

To answer Questions a and b presented above, we first ran a 
direct path model between perfectionism and outcomes. When the 
direct paths from perfectionism to outcome variables were signif- 
icant, we proceeded to test a mediating model in which the paths 
from perfectionism to outcome variables were mediated by aca- 
demic self-efficacy and achievement goals, as illustrated in Figure 1. 
Because academic self-efficacy and achievement goals were as- 
sessed separately in math and English, we tested two models, one 
with math-specific variables and the other with English-specific 
ones. The statistical significance of total indirect effects, involving 
all mediation paths linking one variable to the other, was tested by 
a bootstrapping method with 1,000 bootstrapping samples with 
95% bias-corrected confidence intervals. When the total indirect 
effects proved significant, Sobel tests followed to examine the 
statistical significance of individual indirect paths involved (Kline, 
2005, p. 162). All measurement models and path analyses were run 
with AMOS 7.0 (Arbuckle, 2006). 


Results 


Descriptive Statistics 


Responses to negatively worded items were reverse-coded so 
that high scores represent greater possession of the construct under 
investigation. Skewness and kurtosis statistics indicated that re- 
sponses to all items approximate normal distributions. Frequency 
of missing responses per item ranged between 0 and 4, with 
missing rates less than 1.3% across all items. Missing values were 
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imputed with series means. Table | reports descriptive statistics of 
the scales. 

Mean scores of most scales ranged between 3 and 4 on a 1-5 
response scale with no strong hint of floor or ceiling effects. The 
acceptability of cheating scale was an exception to this trend with 
M = 1.81. Given the socially undesirable nature of this variable, it 
is not surprising that students provided low agreement ratings on 
these items. Responses to all scales showed acceptable degrees of 
internal consistency as presented in Table 1, except for the accept- 
ability of cheating and academic procrastination scales, which 
were associated with somewhat low as. We believe the small 
number of items (n = 3) on these scales was responsible for the 
low reliability. 


Measurement Models 


Measurement models per construct. We performed CFA for 
each construct with individual items as indicators at this stage. 
When the number of indicators rendered the model just-identified 
and hence not testable by itself, we combined theoretically related 
constructs (e.g., the three achievement goals) in a single model to 
gain degrees of freedom. Error covariances were allowed between 
items for the same construct when both of the following conditions 
were met: (a) The content or wording of the respective items 
justified the covariance, and (b) the modification indexes sug- 
gested not only statistically significant but also substantial im- 
provement in model fit. Error covariances were added to the model 
one at a time. Complete results from these preliminary CFAs are 
available from the first author upon request. 

The model for self-oriented perfectionism demonstrated accept- 
able fit to the empirical data, x7(83, N = 304) = 129.863, p < .01 
(TLI = .939, CFI = .952, RMSEA = .043). The socially pre- 
scribed perfectionism scale had five items with low factor loadings 
(As = .20) that failed to reach statistical significance (ps > .05). 
Four of them were reverse-coded items (e.g., “Those around me 
readily accept that I can make mistakes too”) and one was phrased 
in a negative way (i.e., “I find it difficult to meet others’ expec- 
tations of me’). These results indicate that the middle school 
respondents found these items unclear and different from the rest 
of the items. We thus excluded these items from further analyses. 
The final model fit the data well, x7(30, N = 304) = 47.963, p < 
05 (TLI = .951, CFI = .967, RMSEA = .044). 


The model for test anxiety demonstrated acceptable fit, although 
the RMSEA value was slightly over the cutoff criteria, x7(4, N = 
304) = 12.691, p < .05 (TLI = .949, CFI = .980, RMSEA = 
085). The model with acceptability of cheating and academic 
procrastination also displayed satisfactory fit indexes, y7(8, N = 
304) = 22.148, p < .01 (TLI = .914, CFI = .954, RMSEA = 
.076). The model for academic self-efficacy in math fit the data 
reasonably well, again with the RMSEA value slightly over the 
cutoff criteria, x7(4, N = 304) = 12.586, p < .05 (TLI = .958, 
CFI = .979, RMSEA = .084). The model for academic self- 
efficacy in English demonstrated excellent fit, x°(5, N = 304) = 
8.785, p = .118 (TLI = .993, CFI = .997, RMSEA = .050). The 
achievement goal model in both math, x7(24, N = 304) = 64.372, 
p < .001 (TLI = .956, CFI = .971, RMSEA = .075), and English, 
VO4IN = 304)" 747517, 9 = 01 (TLE = 967, CFI 975, 
RMSEA = .057), produced satisfactory fit indexes. 

Full measurement models. Next, we tested measurement 
models in math and English with all variables. Fit indexes are not 
computed, as these models are just-identified with only a single 
indicator for each latent variable. Table 2 presents correlation 
coefficients among the variables. In previous research, the adap- 
tive and maladaptive characteristics of perfectionism and any 
mediating process were best demonstrated in either partial corre- 
lations or path analysis. Therefore, we only describe some of the 
notable findings here. 

Consistent with prior reports, self-oriented perfectionism and 
socially prescribed perfectionism correlated positively with each 
other (r = .56). Both types of perfectionism also correlated posi- 
tively with test anxiety, but the correlation was stronger with 
socially prescribed perfectionism (r = .45) than with self-oriented 
perfectionism (r = .25). Neither type of perfectionism correlated 
with acceptability of cheating. Whereas self-oriented perfection- 
ism correlated negatively with academic procrastination (r = 
— .37) and positively with achievement in both math (r = .24) and 
English (r = .22), socially prescribed perfectionism did not cor- 
relate with academic procrastination and only correlated positively 
with achievement in math (r = .15). 

The two perfectionism variables also exhibited different patterns 
of correlations with subject-specific motivation variables. The 
correlations were largely consistent with the extant literature. 
Self-oriented perfectionism correlated positively with academic 











Table 2 
Correlation Coefficients Among Variables 
Variable a 3 4 5 6 7 8 9 10 
1. Self-oriented perfectionism 1.00 56 RS ENS =e Bee aq 0a 24" 24 
2. Socially prescribed perfectionism — 1.00 45" 01 .03 Ad ie meee gg" mlb 
3. Test anxiety — — 1.00 06 AO —.04 aon Ae eo oe O01 
4. Acceptability of cheating = = — 1.00 Som P20 -.26™™ 04 aM =e 
5. Academic procrastination — — = 1.00 =O del beac) eae eel .09 ea 
6. Academic self-efficacy A ale a2 —.09 Se roomed 1.00 aa 12 ; aa 62" 
7. Mastery goals 43""" 24" Se 0 2 eee aa Ao 1.00 43" 13 ea 
8. Performance-approach goals On FOU Ue .06 = 15 kn om 1.00 oe .09 
9. Performance-avoidance goals oe .49""" Paar al 02 Po .09 Ova 1.00 ell) 
10. Academic achievement eT ail =,{0)) noe aoe 535 Slee lie) S08 1.00 





Note. N = 304. Coefficients from the math model are presented above the diagonal; those from the English model below the diagonal. Dashes indicate 


that coefficients are presented in the upper diagonal. 
Sl (05a tip <a 0 lites eee ee 00: 
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self-efficacy in both domains (rs = .38 in math and .35 in English), 
while socially prescribed perfectionism did not. Both perfection- 
ism variables demonstrated positive correlations with all three 
achievement goals. Nonetheless, a mastery goal correlated more 
strongly with self-oriented than socially prescribed perfectionism 
(rs = .44 vs. .22 in math and .48 vs. .24 in English) and a 
performance-avoidance goal correlated more strongly with so- 
cially prescribed than self-oriented perfectionism (rs = .39 vs. 24 
in math and .49 vs. .29 in English). A performance-approach goal 
exhibited positive correlations with both self-oriented (rs = .50 in 
both math and English) and socially prescribed (rs = .52 in math 
and .60 in English) perfectionism. With few exceptions, the overall 
pattern was highly similar across math and English. 


Path Model With Perfectionism and Outcomes Only 


We performed path analysis with the same set of reliability- 
driven composite scores. Before testing the full path model with 
subject-specific academic self-efficacy and achievement goals as 
mediators, we examined only the direct paths from perfectionism 
to outcome variables. This model fit the data well, x°(10, N= 
304) = 28.376, p < .01 (TLI = .908, CFI = .956, RMSEA = 
.078). Figure 2 presents statistically significant paths at p < .05 
from this model. 

When the two perfectionism variables entered the regression 
equation together, the contrasting characteristics between them 
became clearer. Self-oriented perfectionism did not relate to test 
anxiety but related positively to academic achievement (8 = .37), 
supporting our hypothesis. Although not anticipated a priori, it also 
related negatively to acceptability of cheating (8 = —.34) and 
academic procrastination (8 = —.66). Socially prescribed perfec- 
tionism, in contrast, related positively to test anxiety (8 = .45), 
acceptability of cheating (8 = .25), and academic procrastination 
(8 = .43). Our hypothesis that socially prescribed perfectionism 
would be a positive predictor of maladaptive variables thus re- 
ceived support. Socially prescribed perfectionism did not relate 
significantly to achievement, however. 
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Figure 2. Path model with direct paths from perfectionism to outcomes 
only. Only statistically significant paths at p < .05 are presented. Error 
terms are omitted for clarity. SOP = self-oriented perfectionism; SPP = 
socially prescribed perfectionism; ANX = test anxiety; CHT = accept- 
ability of cheating; PROC = academic procrastination, ACH = achieve- 
ment scores. 


Path Models With Academic Self-Efficacy and 
Achievement Goals as Mediators 


Next, we tested full path models with subject-specific academic 
self-efficacy and achievement goals as mediators. The model dis- 
played satisfactory fit to the data in both subject domains, Ne Us 
N = 304) = 13.927, p> .05 (TLI = .931, CFI = .989, RMSEA = 
.057) in math, and x7(7, N = 304) = 10.980, p > .05 (TLI = .959, 
CFI = .994, RMSEA = .043) in English. Figure 3 presents 
statistically significant paths at p < .05 from these models, with 
coefficients from the math model to the left of the slash and those 
from the English model to the right of the slash. 

Paths from perfectionism to motivation variables. All of 
our hypotheses regarding the relationships of each perfectionism 
variable with academic self-efficacy and achievement goals re- 
ceived support. Specifically, self-oriented perfectionism related 
positively to academic self-efficacy (Bs = .46 in math and .41 in 
English), a mastery goal (8s = .34 in math and .35 in English), and 
a performance-approach goal (8s = .33 in math and .21 in Eng- 
lish) in the respective subject. Socially prescribed perfectionism 
related positively to performance-approach (Bs = .34 in math and 
.47 in English) and performance-avoidance goals (Bs = .32 in 
math and .44 in English). The significant bivariate correlation 
between self-oriented perfectionism and a performance-avoidance 
goal and that between socially prescribed perfectionism and a 
mastery goal were no longer observed when only the unique 
variance in perfectionism was considered. 

Two consistent mediation paths between self-oriented perfec- 
tionism and achievement goals by academic self-efficacy emerged. 
One involved a mastery goal (z = 2.87, p < .01 in math and z = 
3.18, p < .01 in English) and the other a performance-avoidance 
goal (z = —3.36, p < .001 in math and z = —3.27, p < .01 in 
English). Self-oriented perfectionism related positively to aca- 
demic self-efficacy in both subjects, while academic self-efficacy 
in turn related positively to a mastery goal (8s = .24 in math and 
.32 in English) and negatively to a performance-avoidance goal in 
the domain (Bs = —.30 in math and —.32 in English). Table 3 
presents estimates of indirect effects from the math model, and 
Table 4 presents those from the English model, along with results 
from the Sobel tests. 

Paths from self-oriented perfectionism to outcome variables. 
Academic self-efficacy and a mastery goal mediated the signifi- 
cant negative path from self-oriented perfectionism to acceptabil- 
ity of cheating. Whereas pursuing a mastery goal alone sufficed as 
a mediator in this relationship, feeling self-efficacious did not. A 
sense of self-efficacy had to be coupled with a mastery goal in the 
subject domain, which then related negatively to acceptability of 
cheating (8s = —.27 in math and —.32 in English). However, 
although the individual paths linking self-oriented perfectionism to 
self-efficacy, self-efficacy to a mastery goal, and a mastery goal to 
acceptability of cheating were all statistically significant in both 
subjects, the total indirect effects from self-oriented perfectionism 
to acceptability of cheating were not statistically significant, as 
determined by the bootstrapping method. Only the paths linking 
math self-efficacy to acceptability of cheating via a mastery goal in 
math proved significant (z = —2.17, p < .05). 

The direct negative path between self-oriented perfectionism 
and academic procrastination remained significant and negative 
even in the presence of intervening motivational variables. Still, 
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Figure 3. Final path model. Only statistically significant paths at p < .05 
are presented, except for the dotted path with p < .06. Error terms are 
omitted for clarity. Path coefficients in math are presented to the left of the 
slash; those in English to the right of the slash. SOP = self-oriented 
perfectionism; SPP = socially prescribed perfectionism; SE = academic 
self-efficacy, MAP = mastery goal; PAP = performance-approach goal; 
PAV = performance-avoidance goal; CHT = acceptability of cheating; 
ANX = test anxiety; PROC = academic procrastination; ACH = achieve- 
ment scores. 


the coefficients were in substantially reduced magnitude (Bs = 
—.31 in math and —.38 in English) from the coefficient from the 
direct path model (8 = —.66), suggesting mediation effects. The 
path was partially mediated by academic self-efficacy in math 
(z = —3.73, p < .001) and by either a mastery goal alone 
(z = —2.10, p < .05) or by academic self-efficacy and a mastery 
goal together (¢ = —2.06, p < .05) in English. In math, self- 
oriented perfectionism related positively to academic self-efficacy, 
which related negatively to academic procrastination (8 = —.44). 
In English, self-oriented perfectionism related to a mastery goal 
either directly (8 = .35) or via academic self-efficacy (B = .32). 
A mastery goal then related negatively to academic procrastination 
(6B = —.30). 

The significant positive path from self-oriented perfectionism to 
achievement previously observed in the direct path model was 
fully mediated by subject-specific academic self-efficacy in both 
math (z = 4.74, p < .001) and English (z = 3.78, p < .001). 
Self-oriented perfectionism related positively to academic self- 
efficacy in the subject, which positively predicted achievement in 
both math (8 = .62) in English (8 = .47). 

Paths from socially prescribed perfectionism to outcome 
variables. Whereas the paths from self-oriented perfectionism to 
outcome variables were largely mediated by academic self- 
efficacy and the two approach-oriented achievement goals, those 
from socially prescribed perfectionism were not. Two of the three 
direct paths from socially prescribed perfectionism to outcome 
variables previously observed in the direct path model remained 
significant and in comparable magnitude in the mediation models. 
Specifically, socially prescribed perfectionism was a direct posi- 
tive predictor of test anxiety (Bs = .32 in math and .21 in English) 
and academic procrastination (8s = .31 in math and .36 in Eng- 
lish) in both subject domains. The only significant partial media- 
tion was between socially prescribed perfectionism and test anxi- 


ety in English (z = 2.26, p < .05). Socially prescribed 
perfectionism related positively to a performance-avoidance goal 
in English (8 = .44), which related positively to test anxiety (B = 
2s) 


Discussion 


Students come into class armed with not only motivation and 
prior knowledge but also family background, developmental and 
socialization history, and personality characteristics. It is important 
to learn how these diverse factors all come into play in achieve- 
ment settings, at least what the salient patterns are among major 
variables, to understand the “whole” student. It is for this reason 
that we were interested in the role of perfectionism in academic 
contexts in the first place. The present results once again demon- 
strated that the effects of stable personality dispositions, such as 
perfectionism on academic outcomes, although not trivial, do get 
mediated by students’ motivational beliefs in specific subject 
domains. 

Our primary purpose in this research was twofold. First, we tried 
to ascertain the dimensional nature of perfectionism, so as to help 
future research with this personality trait with potentially weighty 
consequences for students in achievement settings. Whereas re- 
searchers seldom question the maladaptive nature of socially pre- 
scribed perfectionism, they have not been able to reach a firm 
conclusion regarding the adaptive nature of self-oriented perfec- 
tionism (Stoeber et al., 2009). We reasoned that dimensionality of 
perfectionism would play out most vividly when assessed in ref- 
erence to typical learning situations, due to the ongoing competi- 
tion, imminent possibilities of failure, and ambiguous definitions 
of success inherent in them. 

Second, we wanted to test once again the importance of aca- 
demic motivation in assisting learners with adaptive as well as 
maladaptive dispositions to adjust and function better in specific 
achievement situations. Because personality traits predispose 
learners to certain motivational tendencies, academic motiva- 
tion likely mediates the processes linking perfectionism to 
achievement-related outcomes (Mills & Blankstein, 2000). Fur- 
ther, because self-oriented and socially prescribed perfectionism 
differ with respect to who initiates and maintains control over 
goals and standards, learners high in each type of perfectionism 
inevitably generate different responses toward identical challenges 
and setbacks and are expected to conclude the same achievement 
episodes differently by following disparate motivational paths. 

The results were largely consistent with our hypotheses. Self- 
oriented perfectionism related positively to academic achievement 
and negatively to acceptability of cheating and academic procras- 
tination in achievement settings. It did not link significantly to test 
anxiety. Socially prescribed perfectionism, in contrast, related 
positively to test anxiety, acceptability of cheating, and academic 
procrastination but did not link significantly to academic achieve- 
ment. Many of the significant paths from perfectionism to out- 
comes were mediated by domain-specific motivation. The paths 
from self-oriented perfectionism to outcomes were mediated 
by academic self-efficacy and a mastery goal in the domain, while 
those from socially prescribed perfectionism were mediated by a 
performance-avoidance goal. Nonetheless, the direct paths from 
the two perfectionism dimensions to academic procrastination and 
that from socially prescribed perfectionism to test anxiety re- 
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Table 3 


Standardized Total, Direct, and Indirect Effects in the Math Model 
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Note. SOP = self-oriented perfectionism; SPP = socially prescribed perfectionism; SE = academic self- 


efficacy, MAP = mastery goal; PAV = performance-avoidance goal; PAP = performance-approach goal; 
ANX = test anxiety; CHT = acceptability of cheating; PROC = academic procrastination; ACH = achievement 


scores. Based on 1,000 bootstrap samples. 
Lion Oe eer Sa (ne <a) Ne 


mained significant, even in the presence of the intervening moti- 
vation variables. 


Self-Oriented as Adaptive Perfectionism and Socially 
Prescribed as Maladaptive Perfectionism 


Consistent with previous reports (Flett et al., 1994, 1995; Hewitt 
et al., 2002; Mills & Blankstein, 2000; Verner-Filion & Gaudreau, 
2010), self-oriented perfectionism and socially prescribed perfec- 
tionism correlated with each other yet were clearly distinguishable 
dimensions for our Korean middle school participants. The corre- 
lation between the two perfectionism dimensions, however, was 
noticeably larger than what has been typically observed in the 
literature. The strong correlation between self-oriented and so- 
cially prescribed perfectionism seems to indicate that the Korean 
adolescents, with a strong desire to meet the extremely high 
standards that their teachers and parents set for them, also tended 
to set similarly high standards for themselves and strove to achieve 
those perfectionistic standards. 

Interdependent self-construal could accentuate socially pre- 
scribed perfectionism as well as strengthen alliance between the 
two perfectionism dimensions. The desire to please significant 
others in the social network that one identifies with, is a strong 
source of motivation for individuals in collectivistic cultures with 
interdependent self-construal (Heine, 2001; Markus & Kitayama, 
1991; Oishi & Diener, 2001). Because conformity is a virtue 


(Markus & Kitayama, 1994), they would ascribe high value to 
what their in-group members consider important. Korean students, 
presumably with stronger interdependent self-construal compared 
to students in Western cultures, could more likely approve the 
goals and standards valued by parents and teachers and internalize 
them as their own. This conjecture should be formally tested in 
future research, however, as the mean score of socially prescribed 
perfectionism in this study was not particularly high. 

In previous research, self-oriented perfectionism has frequently 
demonstrated both positive and negative characteristics, correlat- 
ing positively with variables as varied as depression, test anxiety, 
self-efficacy, and intrinsic and extrinsic motivation (Hewitt et al., 
2002; Mills & Blankstein, 2000; Stoeber et al., 2009). For the 
Korean adolescents participating in this study, in comparison, 
self-oriented perfectionism consistently emerged as a positive pre- 
dictor of adaptive variables, including academic self-efficacy and 
achievement, and a negative predictor of maladaptive variables, 
including acceptability of cheating and academic procrastination. 
Socially prescribed perfectionism primarily functioned as a posi- 
tive predictor of maladaptive variables. As we conjectured and 
consistent with Mills and Blankstein (2000), self-oriented perfec- 
tionism correlated positively with test anxiety, but when the co- 
variance with socially prescribed perfectionism was controlled for, 
the relationship was no longer significant. Self-oriented perfec- 
tionism thus appears to be an adaptive characteristic, while socially 


PERFECTIONISM AND MOTIVATION 2s 


Table 4 


Standardized Total, Direct, and Indirect Effects in the English Model 
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prescribed perfectionism a maladaptive characteristic, to possess in 
the academic domain—at least for the current sample of Korean 
middle school students. 

The nature of direct paths from either type of perfectionism to 
subject-specific motivation was consistent across math and Eng- 
lish. Self-oriented perfectionism related positively to academic 
self-efficacy, a mastery goal, and a performance-approach goal, 
while socially prescribed perfectionism did not relate to academic 
self-efficacy and instead related positively to performance- 
approach and performance-avoidance goals. Consistent with our 
hypothesis, the relationships of perfectionism with outcome vari- 
ables were mediated by academic self-efficacy and achievement 
goals in the subject domain. The pattern of mediation also gener- 
ally stayed the same, regardless of whether motivation in math or 
that in English was examined. This is strong evidence that present 
findings are not confined to particular subject matter domains. 

Even so, there was a difference in the extent to which motivation 
mediated the effects of perfectionism. All paths from self-oriented 
perfectionism to outcome variables were mediated, either fully or 
partially, by motivation variables. Those from socially prescribed 
perfectionism, in contrast, were not as effectively mediated by the 
same variables. Two of the three direct paths to outcome variables 
remained significant and strong, even after the subject-specific 
motivation variables entered the equation. The present results thus 
suggest that self-oriented perfectionism works largely through moti- 
vation of learners in specific achievement contexts, whereas socially 


prescribed perfectionism works more directly on achievement-related 
outcomes. 

It is worth noting that the four variables to which socially 
prescribed perfectionism linked directly and consistently across 
the two subject domains—test anxiety, academic procrastina- 
tion, a performance-approach goal, and a performance-avoidance 
goal—as well as the socially prescribed perfectionism itself, are 
correlates of fear of failure (Elliot & Church, 1997; Speirs 
Neumeister, 2004; Steel, 2007). This finding implies that fear of 
failure may be particularly resistant to contextual influences. Bong 
(2001) offered a similar conjecture when performance-approach 
and performance-avoidance goals of Korean middle and high 
school students displayed noticeably stronger correlations across 
multiple subject matter domains than did other motivation con- 
structs. Compared to non-Asian students, Asian students also 
report stronger fear of failure, which explains their motivation in a 
specific achievement context better than do other constructs such 
as self-efficacy or effort attribution (Eaton & Dembo, 1997). 
Accordingly, it is possible that the relationships of socially pre- 
scribed perfectionism could be better mediated by motivation 
variables among non-Asian learners. 

Of the direct and indirect paths from perfectionism to maladap- 
tive outcomes included in this study, academic procrastination, in 
particular, clearly epitomizes the contrasting nature of the two 
perfectionism dimensions. As students expressed stronger self- 
oriented perfectionism, they were less likely to engage in academic 
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procrastination. As they expressed stronger socially prescribed 
perfectionism, on the contrary, they more frequently engaged in 
procrastinating behaviors. Socially prescribed perfectionism con- 
sistently demonstrates a strong positive correlation with fear of 
negative evaluation (Flett, Hewitt, & De Rosa, 1996; Hewitt & 
Flett, 1991). Also, compared to self-oriented perfectionism, which 
correlates positively with commitment to perfect “performance” at 
school, socially prescribed perfectionism correlates positively with 
commitment to perfect “relationships” with significant others 
(Flett et al., 1995). Students with socially prescribed perfection- 
ism, finding it difficult to satisfy the perfectionistic standards of 
others, could resort to academic procrastination as a desperate 
means to delay unfavorable judgments by, and damaged relation- 
ships with, parents and teachers. 

Milgram, Sroloff, and Rosenbaum (1988) provided yet another 
interesting account of the strong perfectionism—procrastination 
link. They argued that when students feel that parents, teachers, 
and other powerful adult figures impose certain tasks on them, they 
may show greater procrastination as an expression of covert neg- 
ativism. Covert negativism is a type of motivation that represents 
an indirect display of hostility and passive retaliation toward 
authority figures. Steel (2007) also discussed a possibility that 
rebellious individuals, especially young adolescents, procrastinate 
on tasks with externally imposed deadlines because they view 
these tasks to be highly aversive. Socially prescribed perfection- 
ism, by definition, refers to the excessive strivings toward and 
concerns about fulfilling the difficult standards imposed upon 
them by significant others (Hewitt & Flett, 1991). Covert negativ- 
ism seems to explain well the stronger tendency among socially 
prescribed perfectionists to procrastinate in academic situations, 
especially adolescents in collectivistic cultures who find it difficult 
to ignore the wishes of their parents. 


Academic Self-Efficacy as a Positive Amplifier of 
Self-Oriented Perfectionism 


We examined the role of perfectionism in concert with academic 
self-efficacy and achievement goals, arguably the two most prom- 
inent constructs in contemporary academic motivation research. 
Students’ self-efficacy beliefs, or subjective convictions to per- 
form successfully in the given subject domain (Bandura, 1997; 
Schunk, 1991), were particularly effective in augmenting the adap- 
tive aspects of self-oriented perfectionism. The positive direct path 
from self-oriented perfectionism to academic achievement and the 
negative direct paths to acceptability of cheating and academic 
procrastination were all significantly mediated by academic self- 
efficacy in the domain. 

More specifically, self-oriented perfectionism related to stronger 
academic self-efficacy in the subject domain, whether it was math 
or English. Stronger academic self-efficacy, in turn, related to a 
stronger mastery goal and a weaker performance-avoidance goal in 
the subject areas. It also related to less academic procrastination 
and acceptability of cheating among students, directly or indirectly 
through a stronger mastery goal. Most of all, academic self- 
efficacy was the strongest positive predictor of achievement in 
both math and English, a finding that is now clearly established in 
the academic motivation literature (Bong & Skaalvik, 2003; Mul- 
ton et al., 1991; Pajares, 1996; Schunk, 1991; Zimmerman, 2000). 


Whereas academic self-efficacy as a mediator reinforced the 
adaptive functions of self-oriented perfectionism, it did not inter- 
vene between socially prescribed perfectionism with other vari- 
ables. Korean middle school students expressed stronger convic- 
tions for successfully performing in the given subject domains as 
they expressed a stronger desire to achieve highly difficult goals 
but only when those goals were set by themselves. The desire to 
satisfy difficult goals set forth by others—socially prescribed 
perfectionism— did not demonstrate a systematic relationship with 
academic self-efficacy. The goals and standards set by self- 
oriented perfectionists, though they may be excessively demand- 
ing, thus appear to instill a sense of agency and perceived control 
in the individuals, which results in stronger convictions in their 
own capabilities for successfully attaining desired outcomes in the 
given domains. 

Latham and Locke (1991) wrote, “Given sufficient ability, goal 
theory predicts a drop at high goal difficulty levels . . . if there is 
a large decrease in goal commitment” (p. 215). As discussed 
above, socially prescribed perfectionism correlated not with com- 
mitment to perfect performance at school but with commitment to 
perfect relationships with significant others (Flett et al., 1995). 
This suggests that socially prescribed perfectionists lack commit- 
ment to the goals of achieving the perfectionistic performance 
imposed on them. This lack of goal commitment would disrupt the 
relationships between socially prescribed perfectionism, self- 
efficacy, and performance. Whereas socially prescribed perfec- 
tionism does not appear to have direct implications for self- 
efficacy of students in the academic domain, it does appear to 
orient students toward particular types of achievement goals, 
which we discuss next. 


Types of Perfectionism as Antecedents of 
Achievement Goals 


As hypothesized, self-oriented perfectionism linked positively 
to a mastery goal and a performance-approach goal. Socially 
prescribed perfectionism linked positively to a performance- 
approach goal and a performance-avoidance goal. The same pat- 
tern emerged in both math and English. A commonality between 
self-oriented perfectionism and the two approach-oriented achieve- 
ment goals is having an achievement motive as an antecedent. Self- 
oriented perfectionism and a mastery goal also prompt standards- 
based, as opposed to comparison-based, competence evaluation. 
Socially prescribed perfectionism and the two performance-oriented 
goals have fear of failure as a common correlate. Competence ap- 
praisals in socially prescribed perfectionism are carried out against 
standards imposed by others, while those in performance goals are 
executed against criteria determined by others’ performance (Elliot 
& Church, 1997; Elliot & McGregor, 2001; Speirs Neumeister, 
2004; Van Yperen, 2006). It is hence not surprising that self- 
oriented perfectionists readily adopt a mastery goal, while socially 
prescribed perfectionists readily adopt performance goals in 
achievement situations. 

For socially prescribed perfectionists, “others” are more than 
simply a source of comparison standards. Fear of unfavorable 
evaluation from others is a well-established correlate of socially 
prescribed perfectionism (Flett et al., 1996; Hewitt & Flett, 1991). 
In achievement situations, such fear could translate into concerns 
about proving competence to and concealing incompetence from 
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others. In this study, we assessed achievement goals with the items 
used in a study by Elliot and McGregor (2001). The performance- 
approach goal items focused exclusively on normative competence 
without any reference to concerns for ability validation, while the 
performance-avoidance goal items retained components of ability 
validation (e.g., “My fear of performing poorly in this class is often 
what motivates me”). Socially prescribed perfectionism still cor- 
related strongly with both performance goals in both math and 
English. Had we used a different set of performance goal items that 
combines the normative competence and ability validation com- 
ponents (e.g., PALS; Midgley et al., 2000), even stronger relation- 
ships could have emerged between socially prescribed perfection- 
ism and the two performance goals. 

A recent debate in achievement goal research entails whether 
the concern for demonstrating and validating ability in front of 
others should be viewed as a legitimate and indispensible constit- 
uent of performance goals (Elliot & Murayama, 2008; Grant & 
Dweck, 2003). The present results do not speak directly to this 
question. Nevertheless, it deserves to note that students’ desire to 
appear perfectly competent by satisfying the standards imposed on 
them by others significantly and consistently related to both 
performance-approach and performance-avoidance goals across 
two specific subject matter domains in this study. Further, the 
paths from socially prescribed perfectionism to performance- 
approach goals (Bs = .33 and .21) were comparable in strengths 
with those to performance-avoidance goals (Bs = .34 and .47). 

If socially prescribed perfectionism is maladaptive, these results 
suggest that not only a performance-avoidance goal but also a 
performance-approach goal share its maladaptive characteristics. 
The positive correlations of performance-approach goals with 
performance-avoidance goals (rs = .63 and .59) and test anxiety in 
this study (rs = .50 and .42) support this conjecture. In fact, the 
potentially detrimental nature of a performance-approach goal, 
amidst its positive associations with performance indexes, has 
been repeatedly observed in previous studies with younger learn- 
ers. Korean elementary and middle school students, for example, 
do not distinguish between performance-approach and performance- 
avoidance goals (Bong, Woo, & Shin, 2013) and, even when they 
do, their performance-approach goals correlate significantly with 
test anxiety (Bong, 2009) and predict help-seeking avoidance 
(Bong, 2008). Continued research on the makeup and function of 
the performance goal, therefore, seems warranted. 


Limitations 


Several limitations of the present investigation should be noted. 
First, we assumed certain causal predominance among the vari- 
ables included in our model according to theory and previous 
research. However, we measured the variables concurrently, ex- 
cept for the achievement scores that were collected after the 
surveys. A more accurate test of the mediating processes requires 
that presumed antecedents and consequents be assessed with a 
sufficient temporal interval. 

Second, while we assessed motivation and achievement vari- 
ables in reference to specific subject matter areas, outcome vari- 
ables such as test anxiety, acceptability of cheating, and academic 
procrastination were assessed in reference to general learning 
situations. Because one of our primary interests was direct rela- 
tionships between the two perfectionism dimensions and key out- 


come variables, and because a difference in assessment specificity 
between constructs could hamper proper examination of their 
associations (Pajares & Miller, 1995), we decided to assess the 
outcome variables at a level most similar to that of perfectionism. 
Evidence that anxiety (Green, Martin, & Marsh, 2007), cheating 
(Burton, 1963), and procrastination (Milgram, Mey-Tal, & Levi- 
son, 1998) display strong cross-situational consistency aided our 
decision (but see Goetz, Frenzel, Pekrun, & Hall, 2006). Never- 
theless, assessing all variables in the context of specific subject 
domains could disclose an interesting idiosyncrasy associated with 
subject matter learning, which we might have overlooked in this 
investigation. 

Third, we used the Multidimensional Perfectionism Scale 
(MPS) by Hewitt and Flett (1991). This scale is by far the most 
frequently used one in the literature, and it has also been success- 
fully translated and validated for Korean students in previous 
research (Seo & Synn, 2006). However, had we used a newer 
version specifically developed for the younger population, the 
Child-Adolescent Perfectionism Scale (CAPS; Flett, Hewitt, 
Boucher, Davidson, & Munro, 1997), the results could have been 
more accurate. 

Fourth, the acceptability of cheating and academic procrastina- 
tion scales demonstrated internal consistency estimates that were 
less than satisfactory. We were not too seriously concerned about 
the low reliability of these scales because (a) the acceptability of 
cheating scale had been associated with similar estimates of inter- 
nal consistency in previous studies (Anderman et al., 1998; Mur- 
dock et al., 2004), (b) we only used reliable portions of the 
variance in the analysis, and (c) the relationships of the variables 
assessed with these scales with other variables were consistent 
with theory and previous findings. Nonetheless, the low reliability 
of these scales could have compromised integrity of the present 
findings to a certain degree. 


Contributions and Future Directions 


Although perfectionism is a personality trait with strong moti- 
vational implications (Hewitt & Flett, 1991), only a limited num- 
ber of studies to date have directly investigated how perfectionism 
relates to motivation and achievement in the academic domain (see 
Fletcher & Speirs Neumeister, 2012). Further, a majority of these 
studies stop at reporting correlations between perfectionism and 
other variables, without probing the potentially intricate mediation 
or moderation in their associations (Verner-Filion & Gaudreau, 
2010). The present research fills this gap in the literature and 
documents relevance of multidimensional perfectionism for ado- 
lescents in achievement settings. Korean adolescents as young as 
middle school students differentiated self-oriented and socially 
prescribed dimensions of perfectionism. They also manifested a 
distinct pattern of motivation and achievement-related behavior, 
depending on the particular type of perfectionism they possessed. 
When distinguished from socially prescribed perfectionism, self- 
oriented perfectionism was more facilitative than disruptive for 
motivation and learning processes. The current investigation con- 
tributes to the literature by supporting the dimensional analysis of 
perfectionism and adding to the growing body of literature that 
suggests the relatively adaptive nature of self-oriented perfection- 
ism (Stoeber et al., 2009). 
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More important, this study has demonstrated that motivation in 
specific academic contexts mediates the paths linking stable per- 
sonality dispositions such as perfectionism to concrete affective, 
behavioral, and performance outcomes in the academic domain. In 
previous research on the role of perfectionism in students’ aca- 
demic functioning, investigators have typically examined the di- 
rect associations between perfectionism and outcomes without 
considering the intervening motivational processes (Bieling et al., 
2003). When motivation variables were included, they were often 
assessed as motivation for school learning in general, and not as 
domain-specific motivational beliefs (Stoeber & Rambow, 2007; 
Verner-Filion & Gaudreau, 2010). However, such direct-path-only 
or general models would be an oversimplification of the complex 
interrelations among perfectionism, motivation, and outcomes. We 
tried to delineate part of this complexity by assessing some of the 
representative motivational constructs in reference to specific sub- 
ject domains. By doing so, we were able to demonstrate that the 
manner with which.each perfectionism dimension links to various 
outcomes depends, to a considerable degree, on students’ self- 
efficacy beliefs and achievement goals in particular subject matter 
areas. We believe this is an important finding because, even if we 
cannot easily change the perfectionism in students, we can still 
improve the quality of the learning process they engage in by 
altering their domain-specific motivational beliefs. 

Can we say, based on the present findings, that self-oriented 
perfectionism is truly an adaptive personality trait for students’ 
academic functioning? The answer to this question depends on 
several conditions. Most of all, although self-oriented perfection- 
ism played a positive role in this study, it is important to remember 
that it is not a pure form of achievement motivation that mutually 
excludes fear of failure or fear of negative evaluation from others 
(Flett et al., 1996; Frost et al., 1990; Hewitt & Flett, 1991; Speirs 
Neumeister, 2004). On the one hand, it represents the relentless 
propensity to demand a lot from oneself by setting high goals, 
which typically promotes intrinsic motivation, self-efficacy, effort, 
and persistence for attaining those goals. Under normal 
achievement situations where perceived stress or evaluative 
threat is not extreme, self-oriented perfectionism will activate 
approach-oriented motivation. On the other hand, it could turn 
maladaptive and function more similarly to socially prescribed 
perfectionism under extremely stressful situations. 

When Hewitt et al. (2002) divided children into three groups by 
levels of achievement stress, for example, it was only the children 
with high and average levels of achievement or social stress for 
whom stronger self-oriented perfectionism resulted in greater de- 
pression and anxiety. For those children with low levels of 
achievement or social stress, self-oriented perfectionism did not 
show a significant relationship with any of these variables. The 
relationships of socially prescribed perfectionism with maladjust- 
ment symptoms did not depend on levels of stress. When individ- 
uals are under high stress, perceive strong evaluative threat, or 
need to perform high-stakes tasks that are of great importance, 
self-oriented perfectionism correlates with negative affect, depres- 
sion, and anxiety (Frost & Marten, 1990; Hewitt, Mittelstaedt, & 
Wollert, 1989; Stoeber & Rambow, 2007), just like socially pre- 
scribed perfectionism does. We thus suggest that researchers and 
practitioners exercise due caution when interpreting findings re- 
lated to multidimensional perfectionism, taking into account the 
known individual and situational moderators of perfectionism. 


Finally, we suggest that socialization history and resultant dis- 
positional characteristics may present another common ground on 
which different dimensions of perfectionism could link to specific 
motivational beliefs such as academic self-efficacy and achieve- 
ment goals. Elliot and McGregor (2001) reported that person- 
focused negative feedback and conditional approval of mothers 
were antecedents of a performance-avoidance goal. Mothers’ con- 
ditional approval was an antecedent of a performance-approach 
goal as well. Hollender (1965) described a similar socialization 
mechanism spawning perfectionism by stating, “Perfectionism 
most commonly develops in an insecure child who needs approval, 
acceptance and affection from parents who are difficult to please” 
(p. 103). Speirs Neumeister and Finch (2006) demonstrated that 
insecure attachment to parents was indeed an antecedent of both 
perfectionism dimensions, while others showed that socially pre- 
scribed perfectionism usually demonstrates considerably stronger 
correlations with parent-related variables such as parental expec- 
tations and parental criticism (Flett et al., 1995). Future research 
should explore the causal chain among socialization history, mo- 
tive dispositions, and perfectionism of the children and their mo- 
tivation in school. 
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Effortful control has been proposed as a set of neurocognitive competencies that is relevant to self-regulation 
and educational attainment (Posner & Rothbart, 2007). This study tested the hypothesis that a multiagent 
report of adolescents’ effortful control (age 17) would be predictive of academic persistence and educational 
attainment (age 23-25), after controlling for other established predictors (family factors, problem behavior, 
grade-point average, and substance use). Participants were 997 students recruited in 6th grade from 3 urban 
public middle schools (53% males; 42.4% European American; 29.2% African American). Consistent with 
the hypothesis, the unique association of effortful control with future educational attainment was comparable 
in strength to that of parental education and students’ past grade-point average, suggesting that effortful control 
contributes to this outcome above and beyond well-established predictors. Path coefficients were equivalent 
across gender and ethnicity (European Americans and African Americans). Effortful control appears to be a 
core feature of the self-regulatory competencies associated with achievement of educational success in early 
adulthood. These findings suggest that the promotion of self-regulation in general and effortful control in 
particular may be an important focus not only for resilience to stress and avoidance of problem behavior but 
also for growth in academic competence. 


Keywords: educational attainment level, self-regulation, academic achievement, adolescence, family 


background 


Education success and attainment is the clearest index of compe- 
tence and success in modern Western societies. At the individual 
level, higher educational attainment predicts quality of life throughout 
adulthood, including employment status, income, psychological and 
physical health, well-being, and community involvement (Adams, 
2002; Day & Newburger, 2002; Herzog, Franks, Markus, & Holm- 
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berg, 1998; Karvonen et al., 2007; McCaul, Donaldson, Coladarci, & 
Davis, 1992; Ross & Mirowsky, 2006; Tobiasz-Adamezyk, Barto- 
szewska, Brzyski, & Kopacz, 2007; Zhang, Huang, Ye, & Zeng, 
2008). From a societal perspective, it is necessary to promote higher 
rates of secondary school completion, postsecondary technical train- 
ing, and college and graduate training to meet current socioeconomic 
and demographic challenges. These challenges include an aging 
workforce, which requires training of replacement workers, the fast 
pace of technological progress, and market globalization (Organisa- 
tion for Economic Co-Operation and Development, 2005). During 
recent decades, researchers have identified many correlates of stu- 
dents’ educational attainment, but high rates of school dropout and 
low attendance of postsecondary education programs still represent 
significant costs to industrialized countries, including the United 
States (Belfield, Levin, & Brookings, 2007) and Canada (Kirby, 
2009). Thus, key targets must be identified for future intervention 
efforts aiming to help students persevere through their formal school- 
ing. The main objective of this study was to examine the role of 
effortful control, an understudied yet promising predictor of school 
persistence, and to determine whether this predictor remains impor- 
tant after other known predictors of educational attainment are ac- 
counted for. 


Predictors of Educational Attainment 


Many aspects of students’ family background and individual 
characteristics have been studied in the search for significant 


EFFORTFUL CONTROL AND EDUCATIONAL ATTAINMENT oul 


predictors of educational attainment. Family socioeconomic status 
(SES) and family processes are two major predictive family char- 
acteristics that have been examined in relationship to children’s 
educational progression. 

Family SES is a multifaceted concept that affects children’s 
long-term educational outcomes in at least two ways. First, paren- 
tal education plays an important role in children’s educational 
progression. Parents with higher levels of education are more 
likely to encourage their children to pursue higher education and to 
have the resources to support this endeavor. As such, parents’ level 
of educational attainment is a strong and consistent predictor of 
students’ academic persistence as measured in early and middle 
adulthood (Dubow, Boxer, & Huesmann, 2009; Hardy et al., 1997; 
King, Meehan, Trim, & Chassin, 2006; Kristensen, Gravseth, & 
Bjerkedal, 2009; Marjoribanks, 2005; Taylor, Hurd, Seltzer, 
Greenberg, & Floyd, 2010), even after controlling for other sig- 
nificant indicators of family SES, including the value or ownership 
of their housing, family income, and the prestige of parents’ 
occupation (Albrecht & Albrecht, 2011; Dubow et al., 2009; 
Kristensen et al., 2009; Melby, Conger, Fang, Wickrama, & Con- 
ger, 2008; South, Baumer, & Lutz, 2003; Taylor et al., 2010). A 
second implication of family SES is the degree to which it relates 
to family stress, instability, and neighborhood integration. Low- 
SES families tend to have a host of risk factors associated with 
elevated levels of family stress and poorer community integration 
(Albrecht & Albrecht, 2011; Melby et al., 2008; Ou, 2005; South 
et al., 2003; Taylor et al., 2010); risk factors may include frequent 
residential transitions, having young parents, or living in a single 
or unmarried household, all of which are related to lower educa- 
tional attainment. 

Family process factors also play a valuable role in children’s 
educational attainment. Parents who have overly negative interac- 
tions with their children or who have personal problems that 
undermine effective parenting (e.g., couple issues) can impede 
their child’s persistence in school (Dubow et al., 2009; King et al., 
2006). Conversely, children whose parents are involved in their 
education, have a supportive parenting style, or hold high expec- 
tations for their educational attainment tend to stay in school 
longer (Ou, 2005; Pettit, Yu, Dodge, & Bates, 2009; Taylor et al., 
2010). Robertson and Reynolds (2010) looked at the global influ- 
ence of favorable family context by assigning students to clusters 
based on measures of demographic variables (e.g., mother age and 
education, number of adults living in the home, parental employ- 
ment, subsidized meals) and of parenting (e.g., child maltreatment, 
parental involvement, parental expectations). Four clusters were 
found to be internally consistent in terms of human capital re- 
sources (based on demographic data) and family functioning. As 
predicted, children belonging to clusters that had higher levels of 
resources and high-quality parenting reached higher levels of ed- 
ucational attainment. 

Numerous student characteristics have also been evaluated as 
predictors of future educational attainment, and they can be clas- 
sified as risk or compensatory factors. Risk factors include pre- 
dictors of poor academic adjustment, which can precipitate drop- 
out or discourage involvement in higher education. Youth 
externalizing problems, especially when documented in childhood 
or early adolescence, have often been identified as predictors of 
lower educational attainment (King et al., 2006; McLeod & Kaiser, 
2004; Pettit et al., 2009). Substance use later in adolescence also 


has been consistently linked with poorer school persistence (Chat- 
terji, 2006; Hardy et al., 1997; King et al., 2006; Ryan, 2010). 

Compensatory factors that help facilitate progression through 
the education system have also been identified. They include 
students’ educational aspiration and academic success (often as- 
sessed using grade-point average [GPA], standardized test scores, 
inclusion on the honor roll, avoidance of grade retention), which 
are strong and reliable predictors of educational attainment (Al- 
brecht & Albrecht, 2011; Ganzach, 2000; Hardy et al., 1997; King 
et al., 2006; Marjoribanks, 2005; Mello, 2008; Ou, 2005; Pettit et 
al., 2009; South et al., 2003). Cognitive functioning, such as 
childhood IQ or general cognitive ability in early adulthood 
(Dubow et al., 2009; Kristensen et al., 2009), and positive psy- 
chological dispositions, including positive academic self-concept, 
academic engagement, future orientation, and positive tempera- 
mental dispositions (Beal & Crockett, 2010; Hampson, Goldberg, 
Vogt, & Dubanoski, 2007; Marsh & O’ Mara, 2008; Melby et al., 
2008), are also indicative of future educational attainment. 

The extensive literature describing established risk and compen- 
satory factors for educational attainment makes it possible to 
identify with considerable confidence students who are at high risk 
for leaving school before they obtain an adequate level of educa- 
tional training. Because so many of these factors are difficult to 
alter, it is essential to identify student or parent characteristics that 
are amenable to change so that interventions can be developed to 
effectively bolster student retention, reduce dropout, and ulti- 
mately promote educational attainment (Rumberger, 1987). In an 
effort to help determine new predictors that have stronger impli- 
cations for intervention research, we aimed in this study at testing 
effortful control as a predictor of educational attainment by age 23. 


Effortful Control 


Effortful control is an aspect of temperament that reflects self- 
regulatory skill. Effortful control involves the ability to inhibit 
impulses and prevent disruptive behaviors (inhibitory control), to 
focus and maintain attention despite distractions (attention con- 
trol), and to initiate and complete tasks that have long-term value, 
even when they are unpleasant (activation control; Rothbart & 
Bates, 1998). 

Effortful control is heritable and shows moderate stability over 
time, but its development is also shaped by experience (Eisenberg 
et al., 2005; Goldsmith, Buss, & Lemery, 1997). Experimental 
studies have shown that aspects of effortful control can be im- 
proved in children, adolescents, and adults by a range of interven- 
tions, including mindfulness training (Sahdra et al., 2011; Tang et 
al., 2007), self-control exercises (Muraven, 2010), parent training 
(Somech & Elizur, 2012; Stormshak, Fosco, & Dishion, 2010), and 
school-based interventions (Diamond, Barnett, Thomas, & Munro, 
2007; Raver et al., 2011). 

A growing literature reveals that effortful control predicts aca- 
demic success in children and adolescents, even after controlling 
for prior academic performance or general cognitive ability (Allan 
& Lonigan, 2011; Blair & Razza, 2007; Checa, Rodriguez-Bail6n, 
& Rueda, 2008; Checa & Rueda, 2011; Valiente, Lemery- 
Chalfant, & Swanson, 2010; Valiente, Lemery-Chalfant, Swanson, 
& Reiser, 2008; Zhou, Main, & Wang, 2010). Posner and Rothbart 
(2007) have proposed that understanding the neurocognitive fea- 
tures of effortful control, its malleability, and its role in the growth 
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of competence in children is perhaps the most important agenda 
item for future research in education sciences. In fact, Posner and 
Rothbart propose that we should consider educating the human 
brain as much as teaching traditional content domains, such as 
reading, writing, and math. They contend that developing the 
neurocognitive skill of effortful control will benefit growth in 
general cognitive competence as much as in domain-specific skills. 
Although this idea is intriguing, relatively little research has ex- 
amined it in general, let alone specific to adolescence and young 
adulthood. This omission is noteworthy in that adolescence is a 
turning point for many youths, at which time some disengage from 
academics and others persist into higher levels of educational 
attainment. 

In our study, we extended findings about effortful control and 
academic success in childhood and adolescence by examining the 
relationship between effortful control and educational attainment 
in young adulthood. Effortful control may play a particularly 
important role in the pursuit and successful completion of post- 
secondary education. In comparison with earlier years of school- 
ing, postsecondary education has unique qualities that make self- 
regulation especially important. Not only is postsecondary 
education voluntary, it also occurs within the developmental con- 
text of increasing freedom and responsibilities (Arnett, 2000). It 
requires that students manage the demands related to completion 
of their coursework and degree programs (time management, 
course selection, completion of long-term projects) in a context 
that provides less support and structure than is common in earlier 
levels of education. In addition, students are faced with the chal- 
lenges of balancing the demands of their education with an ex- 
panding array of competing options and responsibilities that arise 
in emerging adulthood. Thus, it is expected that higher levels of 
effortful control will promote the planfulness that is involved in 
choosing to pursue higher education and the self-management that 
is required to successfully complete a degree. Consistent with this 
perspective, evidence is emerging that links school persistence and 
aspects of effortful control. For example, a recent study by An- 
dersson and Bergman (2011) revealed that task persistence at age 
13 was a statistically significant, albeit modest, predictor of edu- 
cational attainment 30 years later. In addition, Wolfe and Johnson 
(1995) found that in predicting college GPA, self-discipline out- 
performed SAT standardized assessment scores. Although this 
preliminary research is promising, an important research goal is to 
determine whether effortful control predicts educational attain- 
ment. 


This Study 


The aim of this study was to evaluate the role of effortful control 
in the progression toward higher levels of educational attainment 
in early adulthood. Because of policy and intervention implications 
of this study, we controlled for many of the family and individual 
variables that have historically predicted educational attainment so 
that we could conduct a more stringent test of the unique contri- 
bution of effortful control to educational attainment. Specifically, 
we controlled for key family processes, such as relationship quality 
and effective parenting practices, adolescent problem behavior 
during middle school, adolescent substance use and GPA during 
high school, and sociodemographic factors (family SES and pa- 
rental education). We hypothesized that effortful control would be 


a significant predictor of educational attainment, above and be- 
yond established predictors. 

Effortful control was assessed using parent, teacher, and ado- 
lescent self-report methods to create a multi-informant latent con- 
struct to ensure strong measurement of this focal construct in our 
study. Furthermore, we used a 12-year longitudinal design to 
represent the hypothesized sequence of action of different predic- 
tors and to avoid the inflated correlations that occur when predic- 
tors and outcomes are measured simultaneously. A secondary goal 
of this study was to verify whether our prediction model could 
generalize to students of both genders and to students of various 
ethnic groups. 

To achieve these goals, we used structural equation modeling 
(SEM) to test the model presented in Figure 1. The hypothesized 
sequence of action of various predictors reflects the sensitive 
periods identified in the studies cited earlier in this article in 
relation to family situation, early adolescence problem behavior, 
substance use, and school adjustment as predictors of educational 
attainment. Positive family involvement and problem behavior are 
hypothesized to play an important role in early adolescence and to 
predict more proximal predictors of educational attainment, 
namely, substance use, high school cumulative GPA (CGPA), and 
effortful control in late adolescence. The possibility that early 
predictors are residually related to educational attainment about 10 
years later is indicated by direct paths from early adolescence 
predictors to the outcome measure. To keep Figure | simple, we 
did not depict residual correlations among predictors measured 
during the same developmental period, but they were included in 
the statistical model (i.e., problem behavior was correlated with 
positive family involvement; substance use, high school CGPA, 
and effortful control were intercorrelated; and family SES and 
parental education were correlated with each other and with the 
five other predictors in the model). Our primary analyses were 
conducted on the entire sample, and we tested the generalizability 
of our findings to various subgroups by using multiple-group 
analyses. 


Method 


Participants 


Participants were 997 adolescents and their families from the 
Project Alliance | study recruited in Grade 6 from three public 
middle schools in an ethnically diverse metropolitan community in 
the northwestern United States. Parents of all Grade 6 students in 
two cohorts (years 1996 and 1998) were approached for partici- 
pation, and 90% consented. The participating sample included 526 
males (52.8%) and 471 females (47.2%). By youth self-report, the 
sample comprised 423 European Americans (42.4%), 291 African 
Americans (29.2%), 68 Latinos (6.8%), 52 Asian Americans 
(5.2%), and 164 (16.4%) youths of other ethnicities, including 
mixed ethnicity. Parent reports collected when the adolescents 
were 16 years old revealed that 39.6% of participants lived with 
both genetic parents, 43.8% lived with their biological mother, 
6.7% lived with their biological father, and 10.0% lived in other 
family configurations. The median range of gross annual house- 
hold income was $30,000-$39,999, with 25.3% of households 
earning less than $20,000 per year and 12.7% earning more than 
$90,000. 
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Figure 1. 





also included in the model, even if they are not depicted here. Correlations between family socioeconomic status 
(SES) and parental education, and between these variables and the five other predictors, were included. CGPA = 


cumulative grade-point average. 


Because most participants remained in the same middle school 
from Grade 6 through Grade 8, and because data collection took 
place in the school setting, a high rate of retention was maintained 
across the first three time points. Most participants were streamed 
into a few local high schools whose principals agreed to help us 
track participants, which greatly facilitated data collection in 
Grades 9 and 11. These procedures, however, were not sufficient 
for participants who stopped attending the schools involved in our 
study and were not useful after participants graduated from high 
school. Additional procedures were therefore put in place; namely, 
at each time point, participants were asked to fill out a form with 
their current contact information (mailing address, phone numbers) 
and to provide the contact information of other people (e.g., 
friends, family members) who could help us find them if they had 
moved before the next time point of our data collection. Partici- 
pants were also paid $5 for sending us their new contact informa- 
tion when they moved. Under those circumstances, questionnaires 
that were usually filled out in school could be filled out at home 
and mailed back to us. Together, these longitudinal retention 
procedures were very effective, with approximately 80% of youths 
being retained across the study span. 

One half of the study sample was randomly assigned to a 
multilevel family-centered ECOfit (Dishion & Kavanagh, 2003; 
Dishion & Stormshak, 2007), which aimed at preventing substance 
use and problem behavior in adolescents. Intent-to-treat analyses 
revealed positive intervention effects in relation to substance use 
(Connell, Dishion, & Deater-Deckard, 2006), antisocial behavior 
(Van Ryzin & Dishion, 2012), and the probability of police arrest 
(Connell, Klostermann, & Dishion, 2012). In addition, using com- 
plier average causal effect analyses to assess the impact of fami- 


lies’ engagement in the selected level of this intervention (the 
Family Check-Up), we found significant intervention effects on 
substance use, problem behavior, school grades, and attendance 
during middle and high school (Connell, Dishion, Yasui, & Ka- 
vanagh, 2007; Stormshak, Connell, & Dishion, 2009; Véronneau, 
Dishion, & Connell, 2013). Because improving educational attain- 
ment was not a goal of this program and because traditional 
intent-to-treat effects were not found for academic outcomes in 
middle and high school, we did not expect major differences in the 
covariance matrices of the intervention and control groups based 
on the variables of interest in this study. To verify this assump- 
tion, we used participants’ raw data while testing for equiva- 
lence of the unconstrained covariance matrices for the treatment 
and control groups and found good model fit for most, but not 
all indices, x°(76) = 110.02, p < .01, root-mean-square error of 
approximation (RMSEA) = .03, comparative fit index (CFI) = 
.98, Tucker-Lewis Index (TLI) = .97. The chi-square test 
suggests that we should reject the null hypothesis, stating that 
the treatment and control groups have equivalent covariance 
matrices; in contrast, all other fit indices suggest that constrain- 
ing the covariance matrices of the two groups yields a well- 
fitting model, with both CFI and TLI > .95 and RMSEA < .06 
(Hu & Bentler, 1999). Because the chi-square test may be 
overly sensitive to trivial group differences when large sample 
sizes are used (as is the case in this study), we prioritized the 
other fit indices and concluded that the two groups did not 
differ with regard to the covariance of our study’s variables. 
Therefore, data from the two groups were pooled in this study’s 
analyses. 
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Assessment Procedures 


School-based self-report assessments of problem behavior and 
family involvement were collected from students in Grades 6 
through 8 using an adaptation of a survey instrument developed 
and reported by scientists at the Oregon Research Institute for the 
Community Action for Successful Youth project (Metzler, Biglan, 
Ary, & Li, 1998). In Grade 11, a larger assessment protocol was 
conducted that included additional student self-report surveys, 
teacher ratings that were administered in the high school setting, 
and parent report questionnaires that were completed at home and 
mailed to our research office. This Grade 11 assessment was the 
final school-based assessment. After high school, subsequent as- 
sessments were conducted when participants were approximately 
19 years old and again when they were approximately 23 years old. 
The age 19 assessment was limited in scope and did not pertain to 
the current study. However, age 23 questionnaires captured con- 
structs of interest to our study. At this wave, questionnaires were 
sent directly to participants’ homes and were returned to our 
research office by mail. All respondents were assured of the 
confidentiality of their responses. Participants, parents, and teach- 
ers were compensated for their participation. 


Measures 


Family SES. SES was measured by parent report of their 
employment status, income, housing status, and financial aid to the 
family. For employment status, we used the highest score based on 
reports from both primary caregivers when participants were from 
two-parent families (full time or self-employed [coded 4]; part time 
[3]; seasonal [2]; disabled, unemployed, temporary layoff, home- 
maker, retired, or student {1]). One global score was used for each 
of the other indicators: family housing (own your home [coded 5], 
rent your home [4], motel/temporary [3], live with a friend or live 
with a relative [2], and emergency shelter or homeless |1]); house- 
hold income ($90K or more [coded 7], between $70K and $90K 
[6], between $50K and $70K [5], between $30K and $50K [4], 
between $20K and $30K [3], between $10K and $20K [2], and less 
than $10K [{1]); and financial aid (sum of dichotomous indicators 
of whether the family received food stamps, Aid to Families with 
Dependent Children, other welfare, medical assistance, and Social 
Security death benefits, reverse coded). These variables were stan- 
dardized and averaged (a = .75). In this study, SES information 
was not collected from youths because of concern that it would 
potentially be unreliable information. The Grade 11 data collection 
was the first time point when all parents were surveyed, and thus 
this is the earliest wave of SES data for the overall sample. SES 
was not assigned to a specific developmental period in the model 
and was treated as a fixed variable that other predictors from any 
time point could be correlated with. 

Parental education. At the Grade 11 assessment, caregivers 
reported on the highest level of education that they themselves had 
achieved: graduate degree or college degree (coded 5), junior 
college or partial college (4), high school graduate (3), partial 
high school or junior high completed (2), and 7th grade or less or 
no formal schooling (1). When data were provided for two primary 
caregivers, we used the highest of the two scores. 

Positive family involvement. This latent variable was mea- 
sured from a combination of three youth report indicators. For each 
of these three indicators, an average score based on data collected 


in Grades 6, 7, and 8 was computed as a reliable index for the 
entire middle-school period. The first indicator, positive family 
relations, was based on a six-item scale that included statements 
such as “I really enjoyed being with my parents,” “My parents 
trusted my judgment,” “Family members backed each other up.” 
Each item was scored on a scale ranging from | (never true) to 5 
(always true) within the past month, and a mean score was com- 
puted from the six items (as for Grades 6 through 8 ranged from 
89 to .90). The second indicator, parental monitoring, was based 
on a five-item scale that asked the youths how often their parents 
knew what they were doing away from home; where they were 
after school; what their plans were for the next day; and what were 
their interests, activities, and whereabouts. Each item was scored 
on a scale ranging from | (never or almost never) to 5 (always to 
almost always), and a mean score was created on the basis of all 
five items (as for Grades 6 through 8 ranged from .85 to .87). The 
third indicator, homework rule, included one item that reflected 
whether parents had a rule about the child doing homework every 
day. The item was scored on a scale ranging from 1 (don’t have a 
rule or expectation) to 4 (have a clear rule). 

Problem behavior. Problem behavior was measured using a 
nine-item self-report scale administered in Grades 6, 7, and 8. The 
variable was created from an average score based on data collected 
at all three time points to create a reliable measure for the entire 
middle-school period. Sample items include “Stayed out all night 
without parents’ permission,” “Intentionally hit or threatened to hit 
someone at school,” and “Stole or tried to steal things worth more 
than $5.” Each item was rated on a scale ranging from | (never) to 
6 (more than 20 times), and the reference period was during the 
past month (as at Grades 6 through 8 ranged from .77 to .84). 

Effortful control. The three indicators of the effortful control 
construct were administered in Grade 11: parent report, self-report, 
and teacher report. Parent and child reports were based on the 
Effortful Control scale from the short form of the Early Adolescent 
Temperament Questionnaire—Revised (EATQ-R; Ellis & Rothbart, 
2005). The EATQ-R Effortful Control scale consists of 16 items 
that assess activation control (the capacity to perform an action 
when there is a strong tendency to avoid it; e.g., “If I have a hard 
assignment to do, I get started right away”), attention (the capacity 
to focus attention as well as shift attention when desired, e.g., “It 
is really easy for me to really concentrate on homework prob- 
lems”), and inhibitory control (the capacity to plan and to suppress 
inappropriate responses, e.g., “I can stick with my plans and 
goals”). Each item was scored on a scale ranging from 1 (almost 
always untrue) to 5 (almost always true), with higher scores 
indicating greater effortful control 

Previous work by Ellis and Rothbart (2001) reports evidence of 
the validity of the Effortful Control scale for a sample of adoles- 
cents ranging in age from 10 to 16. Their study demonstrated 
adequate internal consistency (a = .80 for the self-report, « = .87 
for the parent report) and acceptable convergence (r = .50) be- 
tween adolescent and parent report (Ellis, 2002). The self- and 
parent report versions include essentially the same items, with the 
pronouns changed appropriately. For the parent reports, partici- 
pants’ mothers, fathers, and other guardians could all complete the 
Effortful Control scale. When multiple caregivers responded, those 
answers were averaged into one parent report score. Internal con- 
sistency for the 16-item Effortful Control scale was .63 for youths, 
.17 for mothers, and .82 for fathers. 
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The third indicator, teacher report of effortful control, consisted 
of five items with content similar to that of the EATQ-R Effortful 
Control scale (e.g., “thinks ahead of time about the consequences 
of actions,” “plans ahead before acting,” “pays attention to what he 
or she is doing,” “works toward goals,” and “sticks to what he or 
she is doing until it is finished, even with unpleasant tasks’). 
Teachers used a 5-point rating scale to describe how frequently 
each participant engaged in these behaviors. The internal consis- 
tency of the teacher report scale was a = .94. 

CGPA. Students’ academic records were gathered from the 
schools from Grade 9 through 11. If a participant moved to 
another school, we sought academic records from the new 
school as well. GPA was measured on a scale ranging from 0 to 
4, with higher scores reflecting better grades (F = 0, D = 1, 
C = 2, B = 3, A = 4). GPA was obtained at the end of each 
school year as the average grade across participants’ academic 
courses for that year. For youths who attended multiple schools 
during an academic year, an adjusted GPA was computed as the 
average of the available GPAs, weighted to reflect the propor- 
tion of the school year they represented. Our analyses used a 
CGPA measure computed as the average of all yearly GPA data 
available for Grades 9 through 11. For the cohort of participants 
who were originally enrolled in 1998 (about half the partici- 
pants), CGPA in Grade 11 was unavailable because of a change 
in the school district’s record-keeping system. Other students 
had missing GPA data because of school dropout or because 
they attended schools that were unable to provide official 
academic records. As a result, 47% of participants had a CGPA 
measure based on all 3 years of high school; 30% had a CGPA 
measure based on Grades 9 and 10, and 12% had a CGPA based 
on Grade 9 only, resulting in 89% of participants with valid 
GPA data for the main analyses. Correlations between CGPA 
and yearly GPA were .80 for Grade 11, .93 for Grade 10, and 
.93 for Grade 9 (all ps < .001). 

Substance use. Participants completed a survey in Grade 11 
that enabled us to measure the extent of their substance use. 
Participants reported on their use of tobacco, alcohol, mari- 
juana, and other drugs, and an average score for substance use 
was created. Participants were asked to report their frequency of 
use during the past 3 months for each substance, on a scale 
ranging from 0 (never) to 7 (2 or 3 times a day or more). “Other 
drugs” was defined for the participants as any of the following 
substances: heroin, morphine, cocaine or crack, speed or meth, 
ecstasy, angel dust or PCP, acid or LSD, mushrooms, gasoline, 
glue, other inhalants, and prescription medications for recre- 
ational use. 

Ethnicity. Although various ethnic groups were represented 
in this sample, only the two largest groups (European American 
and African American) could be used for ethnic comparison pur- 
poses, and we used youth report of their ethnicity. 

Educational attainment. Participants reported on the highest 
level of education they had completed as of the age 23 assessment. 
This information was coded on a 4-point scale: less than high 
school (coded 1), high school/GED (2), trade school/some college/ 
specialized training/2-year college degree (3), or 4-year college or 
graduate degree (4). This measure was treated as an ordered 
categorical variable in the primary analyses. 


Results 


Preliminary Analyses 


Missing data. For the variables included in our study, the 
mean percentage of missing data was 14% (range = 0%-33%). 
Little’s missing completely at random (MCAR) test was signifi- 
cant, x7(361) = 505.54, p < .001, indicating that the data were not 
MCAR. We explored patterns of missingness based on the amount 
of missing data for different subgroups of participants by counting 
the number of variables for which there was a missing value for 
each participant. Then, we examined correlations between the total 
number of missing values for each participant and their scores on 
other measured (i.e., nonmissing) variables. 

Missing data were more common among male participants and 
among participants with lower educational attainment, lower 
CGPA, lower SES, lower parental education, lower parent- 
reported effortful control, less parental monitoring, and more sub- 
stance use (rs = .08-18, ps < .05). Missingness differed signifi- 
cantly across ethnic groups, F(2) = 4.66, p < .01. When 
comparing European Americans, African Americans, and other 
minority groups combined, a post hoc Scheffé test revealed that 
participants from other minority groups had more missing data 
than did European American participants (mean difference = 1.21; 
jo = Md): 

Covariance coverage was moderate to high, ranging from .59 to 
1.00. Full information maximum likelihood (FIML) was used 
within Mplus 7.0 to estimate parameters on the basis of all avail- 
able information from each participant. Consequently, participants 
with occasional missing data were retained in the analyses. FIML 
has been shown to be very efficient when analyzing data from 
samples with moderate levels of missing values, and it is adequate 
even when data are not MCAR, as long as the predictors of 
missingness are included in the model (Widaman, 2006). 

Descriptive statistics and correlations. Means, standard de- 
viations, and correlations among all measured variables are pre- 
sented in Table 1, along with the number of participants who 
provided valid data on each measure, and skewness and kurtosis 
values. Early problem behavior had a skew value greater than 2.0 
and a kurtosis value greater than 8.0 (cutoffs provided by Kline, 
2005) and was thus square-root transformed. The transformed 
variable did not have significant skew or kurtosis and was used in 
all subsequent analyses. All other variables were approximately 
normally distributed (skew < 2.0 and kurtosis < 8.0). As ex- 
pected, educational attainment had a strong positive correlation 
with CGPA; a moderate positive correlation with family SES, 
parental education, and effortful control according to the teacher, 
and a weaker but significant positive correlation with positive 
family relations, parental monitoring, homework rule, and both 
self-report and parent report of effortful control. Educational at- 
tainment had a weak but significant negative correlation with early 
adolescence problem behavior and late adolescence substance use. 
CGPA, measures of positive family involvement, parental educa- 
tion, and SES were negatively correlated with measures of prob- 
lem behavior and substance use. 

Group differences. Gender and ethnicity differences in all 
observed variables were examined with a series of one-way anal- 
yses of variance. Females had higher educational attainment, 
higher CGPAs, higher ratings on caregiver and teacher reports (but 
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Table 1 
Descriptive Statistics and Bivariate Correlations 
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not self-reports) of effortful control, less early adolescence prob- 
lem behavior, and more early adolescence parental monitoring 
than did males (all Fs > 10.0, ps = .001). 

Ethnic differences were obtained for all measures except 
caregiver-reported effortful control. African American participants 
had higher self-reported effortful control but lower teacher- 
reported effortful control relative to Caucasian participants. 
CGPA, parental education, SES, and parental monitoring were 
lower for African American participants, and homework rule and 
positive family involvement were higher for African American 
participants. African American participants reported more problem 
behavior in early adolescence but less substance use in late ado- 
lescence, relative to Caucasian participants (all Fs > 4.45, ps < 
.05). 


Primary Analyses 


Hypothesis testing proceeded in two steps: evaluation of the 
hypothesized model (see Figure 1) and examination of group 
differences (gender and ethnicity) in model fit. We evaluated the 
fit of the hypothesized model to the data using Mplus 7.0. SEMs 
were run using the mean- and variance-adjusted weighted least 
square estimator because the outcome variable (educational attain- 
ment) was ordered categorical. Therefore, parameter estimates for 
the predictors of educational attainment can be interpreted as 
probit regression coefficients. Residual errors were allowed to 
correlate for latent-variable indicators with shared measures (1.e., 
child- and parent-reported effortful control, both of which used the 
EATQ-R questionnaire) and/or shared reporters (i.e., child- 
reported indicators of effortful control and family involvement). 
The model was deemed to have adequate fit if the CFI was > .95, 
and the RMSEA was < .06 (Hu & Bentler, 1999). Good model fit 


is usually indicated by nonsignificant chi-square values, but be- 
cause of the large size of our sample, this index of fit may be 
overly conservative (Schermelleh-Engel, Moosbrugger, & Miiller, 
2003). In this situation, it is common practice to give priority to the 
other fit indices in model fit evaluation. 

To examine group differences, we ran a series of multiple-group 
analyses (for gender and ethnicity) and compared model fit for 
unconstrained models (all regression and correlation coefficients 
free to vary across groups) and constrained models (coefficients 
constrained to be equal across groups). Because of the large 
sample size, we used change (A) in CFI to test for the significance 
of differences in fit. Fit was considered to be significantly different 
if the change in CFI was .01 or greater (Cheung & Rensvold, 
2002). 

The hypothesized model provided a good fit to the data, 
x*(29) = 116.18, p < .001, CFI = .96, RMSEA = .06. Standard- 
ized coefficients for regression paths and factor loadings are pre- 
sented in Figure 2. There were three significant predictors of 
educational attainment: adolescent effortful control (8 = .33, 
SE = .09), parental education (B = .29, SE = .04), and high school 
CGPA (8 = .26, SE = .08). All three predictors had effect sizes in, 
or very close to, the moderate range. We built a 95% confidence 
interval around these coefficients to test the null hypothesis that 
these predictors were of equal strength, and we were unable to 
reject it. This suggests that adolescents with higher levels of 
effortful control at age 17 had higher levels of educational attain- 
ment by age 23, and the unique relation of effortful control with 
future educational attainment is comparable in strength to that of 
other well-established predictors. Other control variables used in 
this model were not statistically significant predictors of educa- 
tional attainment, including family SES, problem behavior, and 
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Figure 2. Model results (regression paths and factor loadings). Coefficients are standardized. All solid paths 
are significant at p < .05 or smaller. Other regression paths mentioned in Figure | that are not depicted here were 
included in the structural equation modeling analyses but were not significant. SES = socioeconomic status; 


CGPA = cumulative grade-point average. 


family involvement in early adolescence. Similarly, late adoles- 
cence substance use was not associated with educational attain- 
ment. These nonsignificant paths are omitted from Figure 2 for 
parsimony, but they were still present in the statistical model. 
Correlations that were modeled between residual errors because of 
shared measures or reporters were positive and significant (rs = 
.10—.28, ps < .01). The estimated correlation matrix for the latent 
variables in the model is presented in Table 2. Model-estimated 
residual correlations among variables that were measured within 
the same developmental period were identical to those reported in 
Table 2, except for the following: Family SES correlated signifi- 
cantly with substance use (r = .08, p < .05), CGPA (r = .35, p < 
.001), and effortful control (r = .15, p < .01); parental education 
correlated significantly with substance use (r = .10, p < .01), 
CGPA (r = .35, p < .001), and effortful control (r = .11, p < .05); 


substance use correlated significantly with CGPA (r = —.10, p < 
.O5) and effortful control (r = -.22, p < .001); and CGPA 
correlated significantly with effortful control (r = .67, p < .001). 

Tests of indirect effects were performed using confidence inter- 
vals based on the bias-corrected bootstrap method (MacKinnon, 
Lockwood, & Williams, 2004) to verify whether the late adoles- 
cence predictors—effortful control and academic achievement— 
could explain the relation between early adolescence predictors 
(family involvement and problem behavior) and educational at- 
tainment. Results revealed that effortful control was a significant 
mediator for none of the early adolescence predictors. CGPA was 
a marginally significant mediator of the relationship between early 
adolescence family involvement and educational attainment, with 
a 90% confidence interval for the 8 value ranging from .003 to 
.076 (point estimate = .032). Furthermore, CGPA was a signifi- 
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cant mediator of the relation between early adolescence problem 
behavior and educational attainment, with a 99% confidence in- 
terval for the B value ranging from —.131 to —.005 (point esti- 
mate = —.068). 

Group invariance tests were conducted to determine whether 
differences in model fit were evident across groups, which would 
suggest moderation effects based on gender or ethnicity. Tests for 
group differences in model fit revealed no significant differences 
between constrained and unconstrained models for gender 
(ACFI = .002). The pattern of results obtained from a pooled 
within-group covariance matrix was identical to the one presented 
in Figure 2. In line with preliminary analyses, multiple-group 
analyses comparing ethnic groups (Caucasian vs. African Ameri- 
can) revealed that constraints imposed on mean levels of several 
variables had to be released. These included family SES, teacher 
rating of self-regulation, and parental monitoring. The constraint 
on the residual (unexplained) variance for educational attainment 
was also relaxed. This new model did not differ significantly from 
the unconstrained model. 


Discussion 


The main objective of this study was to test whether adoles- 
cents’ effortful control is a significant predictor of their educa- 
tional attainment in early adulthood, above and beyond established 
academic, familial, behavioral, and demographic factors. The sig- 
nificant relationship between effortful control and educational 
attainment supported our hypothesis, and follow-up analyses re- 
vealed that the final model applied to both genders and was 
generalizable across European American and African American 
participants. 


Effortful Control as a Predictor of 
Educational Attainment 


Effortful control is defined as a temperament-based individual 
characteristic that reflects self-regulatory skill, manifested by the 
ability to inhibit impulses and disruptive behaviors (inhibitory 
control), to focus and maintain attention in spite of distractions 
(attention control), and to initiate and complete tasks that have 
long-term value (activation control; Rothbart, Ellis, & Posner, 
2011). In this study, we tested whether effortful control was related 
to educational attainment after accounting for other well- 
documented predictors. After controlling for other factors, effort- 
ful control was directly associated with educational attainment. 
Moreover, our findings indicate that effortful control is as impor- 
tant as parental education and high school academic achievement 
for predicting educational attainment in early adulthood. 

Several mechanisms could explain the relationship between 
effortful control and educational attainment and should be inves- 
tigated in future studies. One possibility is that as students progress 
through the late high school years and postsecondary education, 
they must increasingly rely on their own volitional resources as 
parents and teachers step out of their supervisory responsibilities to 
encourage students’ autonomous academic development. Ade- 
quate levels of effortful control may support the planfulness and 
self-management needed to successfully complete a postsecondary 
degree. In addition to increased demands on students’ autonomy 
and planning skills, the changing nature and context of the school- 


work required of them can also represent a significant change in 
their academic life. Being able to adapt their work habits accord- 
ingly (e.g., creating study groups; starting to work on assignments 
many weeks before the deadline) and to maintain these new 
behaviors over the long term instead of persisting with or going 
back to old habits that may not be adaptive in this new context 
could be one way in which effortful control influences academic 
success and persistence. 

Our findings are consistent with those presented in past studies 
that have explored other constructs related to effortful control as 
predictors of educational and professional success in adulthood 
(Andersson & Bergman, 2011; Wolfe & Johnson, 1995). This 
study builds on existing literature that underscores the importance 
of parental education and youth academic success as key predic- 
tors of educational attainment. Beyond parental support and aca- 
demic ability, adolescents’ self-regulatory capacity inherent in 
effortful control makes a compelling argument for the importance 
of targeting effortful control in efforts to promote school persis- 
tence. 

Our study findings also are consistent with those from past 
studies that have identified processes that can promote effortful 
control functioning (e.g., Fosco, Frank, Stormshak, & Dishion, 
2013; Muraven, 2010; Stormshak et al., 2010). Although our study 
was not designed to test for predictors of effortful control, we did 
identify direct links between positive family involvement and 
problem behavior during early adolescence and later effortful 
control; however, we were unable to find significant indirect 
effects involving adolescent effortful control as a mediator of 
positive family involvement in the prediction of educational at- 
tainment. Nevertheless, the role of parenting in promoting effortful 
control is supported by other research, including a study by Bow- 
ers et al. (2011), showing that aspects of self-regulation closely 
related to effortful control tend to decrease during adolescence but 
can increase under conditions of good parental practices, as do 
GPA and school attendance. 


Early Adolescence Predictors 


Previous work that had investigated the contribution of early 
adolescence predictors of educational attainment prompted us to 
expect a negative relationship between problem behavior and 
future levels of educational attainment, and a positive relationship 
between positive family involvement—including the quality of 
relationships, parental monitoring, and rules about doing home- 
work—and educational attainment. However, in our model, the 
direct paths between these factors from early adolescence and 
educational attainment were not significant. Instead, our findings 
suggest that these relationships are mediated by more proximal 
factors, such as academic achievement, which was a moderately 
strong predictor of educational attainment. The indirect effects of 
problem behavior and of family involvement on educational at- 
tainment support the idea that these early predictors do matter and 
deserve attention from researchers and practitioners who seek to 
promote educational attainment in youths beginning at an early 
age. 

Regarding the family-related predictors, it had already been 
established that warm but structuring parenting can facilitate aca- 
demic achievement and discourage adolescents’ substance use 
(Coombs & Landsverk, 1988; Leung, Lau, & Lam, 1998; Stein- 
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berg, Lamborn, Dornbusch, & Darling, 1992). Of particular inter- 
est to us, though, was the possibility that family involvement could 
help promote greater effortful control during adolescence, as sug- 
gested by recent studies (Bowers et al., 2011; Doan, Fuller-Rowell, 
& Evans, 2012). Without any earlier measurement of effortful 
control in the sample, it was not possible to verify whether this 
relationship simply reflected the enduring consequences of parent— 
child dynamics promoting effortful control early in childhood. 
Nevertheless, Stormshak, Fosco, and Dishion (2010) found that a 
parent-focused intervention was related to an improvement in their 
children’s effortful control over time, which supports the view that 
parents can actively help their child develop higher levels of 
effortful control in middle school. Effortful control was, in turn, 
predictive of an increase in academic engagement in high school. 
This is a new and promising avenue for applied research in the 
domain of academic persistence. In fact, the role of family rela- 
tionships may be particularly consequential, considering a study by 
Belsky and Beaver (2011) that suggested that genetic predisposi- 
tions can make male adolescents particularly vulnerable to deficits 
in self-regulation when they are exposed to poor parenting prac- 
tices. 

The association between problem behavior in childhood and 
academic outcomes in adolescence has already been documented 
(Véronneau, Vitaro, Pedersen, & Tremblay, 2008), but the path 
from problem behavior to effortful control is of greater interest in 
this study. The absence of repeated measures of problem behavior 
and effortful control makes it difficult to settle with confidence on 
a specific direction of a possible causal effect. Numerous studies 
have linked lower levels of effortful control (or related self- 
regulatory skill) in early childhood to later development of prob- 
lem behaviors (e.g., Eiden, Edwards, & Leonard, 2007; King, 
Lengua, & Monahan, 2013; Lengua, 2006; Robins, John, Caspi, 
Moffitt, & Stouthamer-Loeber, 1996). However, our study also 
supports the possibility that young adolescents who engage in 
antisocial activities may, as a result, be diverted from opportunities 
to practice and reinforce their ability to exert effortful control—for 
example, by being suspended from school. 


Late Adolescence Predictors 


Previous research that had suggested that substance use in later 
adolescence and academic achievement in high school could help 
predict which students would reach higher levels of educational 
attainment motivated us to include these two predictors as concur- 
rent control variables when testing for effortful control as a pre- 
dictor of educational outcomes. 

Although the nonsignificant role of substance use in this study 
contrasts with results from other studies that revealed a significant 
role with a similar outcome, several explanations for the discrepant 
results are possible. For example, Hardy et al. (1997) found that 
smoking cigarettes in adolescence is related to lower levels of 
education in adulthood, but their educational outcome distin- 
guished only between students who obtained a high school diplo- 
ma/graduate equivalency degree and those who did not. Further- 
more, their study assessed inner-city children who had been born 
in the 1960s, in contrast with our participants who were from a 
wider range of demographic backgrounds and who had been born 
in the 1980s. Cohort effects or differences in demographic back- 
grounds could explain the divergence of results between their 


study and ours. Ryan (2010) found a detrimental contribution of 
marijuana use, but again the sample of participants had been born 
much earlier (the late 1950s to early 1960s), and the control 
variables used in this study focused more heavily on sociodemo- 
graphic characteristics than on family dynamics and students’ 
academic achievement. A study by King et al. (2006) in which a 
more comparable set of control variables was used revealed, by 
using growth modeling techniques, a significant contribution of 
drug use (but not alcohol use) to the likelihood of attending 
college. This finding suggests that research focused on the specific 
contribution of substance use to educational attainment would 
benefit from sophisticated longitudinal modeling of such variables. 
Because effortful control has already been shown to reduce the risk 
of increases in tobacco and marijuana use over time (Piehler, 
Véronneau, & Dishion, 2012), it is interesting to note that the 
association between effortful control and educational attainment in 
our study was completely independent from substance use. In other 
words, it is unlikely that the link between effortful control and 
higher educational attainment can be explained merely by the 
capacity to refrain from using substances. 

Consistent with past research, our study revealed that academic 
competence in high school as measured from school records of 
academic achievement (CGPA) was a significant predictor of 
educational attainment, and its influence was moderate in size, just 
like that of effortful control and of parental education. It should be 
noted that CGPA was highly correlated with effortful control (r = 
.71 for the estimated bivariate correlation, and r = .67 for the 
model estimated residual correlation). The strong correlation be- 
tween academic achievement and effortful control is consistent 
with results from theoretical work and empirical work linking 
effortful control to academic performance (e.g., Allan & Lonigan, 
2011; Checa et al., 2008; Posner & Rothbart, 2007; Valiente et al., 
2010). Longitudinal studies with repeated measurements of both 
effortful control and academic achievement would help confirm 
the sequence of action of effortful control and academic achieve- 
ment that predict educational attainment. 


Sociodemographic Factors 


In line with past research, students’ sociodemographic back- 
ground played an important role in the prediction of educational 
attainment. We expected that families with higher income and 
more stable living conditions (e.g., owning or renting a house 
rather than living in a precarious housing situation) are in a better 
position to support their child through their high school studies and 
provide financial resources that facilitate access to higher educa- 
tion. Although a moderate correlation emerged between family 
SES and participants’ educational attainment in preliminary bivari- 
ate analyses, this association was not significant in the overall 
model, when controlling for other predictors. In contrast, parent 
education was linked to children’s educational attainment in the 
overall model, independent of family financial resources. This 
finding provided support for our decision to examine the influence 
of parent education and that of other SES indicators separately. 
Our study cannot speak to the mechanisms linking parent educa- 
tion to child educational attainment in this sample, but numerous 
plausible explanations have been identified by other studies, in- 
cluding parental involvement, parents’ ability to understand and 
navigate the school system, parental expectations, and family 
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attitudes toward schooling (e.g., Martin, 2012; Pettit et al., 2009). 
Given that effortful control is partly heritable, the link between 
parents’ and children’s educational attainment might also reflect 
genetic predispositions for self-regulatory skills that support 
school success and persistence. 


Strengths and Limitations 


This study possesses many strengths. First, our main predictor, 
effortful control, was based on a latent variable that included 
parent, teacher, and self-reports. Also, we were able to control for 
most of the established predictors of educational attainment, which 
strengthens our conclusions about the significant role of effortful 
control in predicting our outcome of interest. In addition, the 
longitudinal design made it possible to use predictors at important 
times of development from early to late adolescence and to assess 
educational attainment in early adulthood, when a good level of 
variance has emerged in this variable. The large number of par- 
ticipants helped us identify small effects and compare results 
across subgroups of participants (gender, ethnicity). It is notewor- 
thy that the relationships between the many predictors in this 
model and educational outcomes were consistent across genders 
and ethnic groups (European American vs. African American). 
This suggests that concrete interventions based on the results from 
this study are likely to be relevant for most students. Further 
research that includes a larger number of students belonging to the 
smaller ethnic groups is needed, however, to verify whether our 
results generalize to them. 

Some limitations in this study would be useful to consider in 
future work. Having access to earlier measurements of effortful 
control would have been very helpful to test its contribution to 
educational attainment from a process standpoint. For example, 
effortful control at an early age could affect educational attainment 
through its influence on academic achievement, family relation- 
ships, or other mediators. Repeated measures of effortful control 
could even help determine whether it can be increased through 
environmental influences or intervention programs. To help ex- 
plore those possibilities, a more recent study by our research group 
(Project Alliance 2; Stormshak et al., 2010) included several 
measures of effortful control completed during the adolescent 
years. Another limitation is that this study had no measure that 
allowed us to control for students’ educational aspiration, which 
has been shown in past research to be a significant predictor of 
educational attainment (e.g., Dubow et al., 2009; Marjoribanks, 
2005; South et al., 2003). In addition, the longitudinal nature of the 
study led to some missing-data issues. In general, missing data was 
more common among males and among lower functioning adoles- 
cents (lower CGPA, lower effortful control, more substance use) 
and parents (lower SES, lower parental education, less parental 
monitoring). These patterns might limit the generalizability of our 
results and suggest that lower functioning participants might have 
had more difficulty responding to the questionnaires, possibly 
because of lower reading abilities or because of additional stressful 
life events that may leave them less time or less availability to 
answer a questionnaire. Still, by using FIML to manage missing 
values, we are confident that our results are less biased than those 
we would have obtained using other popular strategies (e.g., list- 
wise deletion, mean substitution, single imputation; Widaman, 
2006). 


Conclusion 


In this study, we showed that effortful control in late adoles- 
cence is a significant predictor of educational attainment by age 
23, and its associated effect size was comparable to those of high 
school CGPA and parental education. This finding indicates the 
importance of self-regulatory skills for success in postsecondary 
education and suggests that efforts to improve educational attain- 
ment may be enhanced by programs that promote the development 
of self-regulatory skills. To date, research examining the mallea- 
bility of effortful control through socialization and through expo- 
sure to cognitively and emotionally challenging tasks has shown 
encouraging results in children and adolescents. Dropout preven- 
tion programs could include an effortful control reinforcement 
component that begins early on and continues throughout the high 
school years as a way to further support the pursuit and completion 
of higher education. Substantive and lasting improvement in the 
level of educational attainment in the population is likely to require 
a combination of strategies that targets not only individual students 
but also their environment, including family members, schools, 
community institutions, and governing bodies at the local and 
national levels. Programs that support the development of self- 
regulation may prove to be an important part of these efforts. 
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Self-handicapping represents a frequently used strategy for regulating the threat to self-esteem elicited by 
the fear of failing in academic achievement settings. Several studies have documented negative associ- 
ations between self-handicapping and different educational outcomes, inter alia academic achievement. 
However, studies on the relation between self-handicapping and academic achievement have yielded 
heterogeneous results, indicating the need to conduct meta-analytic investigations and to examine the 
relevance of several potential moderator variables. This meta-analysis integrates the results of 36 field 
studies with 49 independent effect sizes (V = 25,550). A random effects model revealed a mean effect 
size between self-handicapping and academic achievement of r = —.23 (p < .001, range: r = —.46 to 
r = .02). Moreover, moderator analyses showed that the type of self-handicapping scale, the school type 
(elementary, middle, high school, university), the level of mastery goals in the sample, and the reliability 
of the self-handicapping scale considerably influenced the mean correlation. Based on our findings, we 
conclude that educational interventions to enhance academic achievement should additionally focus on 


preventing self-handicapping. 


Keywords: self-handicapping, academic achievement, meta-analysis 


In the context of academic learning, students sometimes expe- 
rience threats to their self-esteem. These threats are often elicited 
by the fear of failing in upcoming achievement situations such as 
an important exam. A common strategy for regulating this kind of 
self-esteem threat is self-handicapping, which has been defined as 
constructing impediments to performance to protect or enhance 
one’s perceived competence (Berglas & Jones, 1978). Examples of 
academic self-handicapping include procrastinating, effort with- 
drawal, and claiming test anxiety or illness (Urdan & Midgley, 
2001). There is substantial agreement in the literature that aca- 
demic self-handicapping has negative effects on important educa- 
tional processes and outcomes such as motivation and achievement 
(e.g., Martin, Marsh, & Debus, 2001a; Urdan, Midgley, & Ander- 
man, 1998; Zuckerman, Kieffer, & Knee, 1998). However, find- 
ings from field studies on the relation between self-handicapping 
and achievement have reflected considerable heterogeneity, rang- 
ing from nonsignificant (Rhodewalt & Hill, 1995), to moderately 
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negative (Boon, 2007; Schwinger & Stiensmeier-Pelster, 2012), to 
large negative correlations (Midgley & Urdan, 1995, 2001). The 
variability in results has prohibited researchers from drawing a 
general conclusion concerning the mean effect of self-handi- 
capping on achievement; and this, in turn, makes it difficult to 
estimate the implications of self-handicapping for educational 
practice. Moreover, these inconsistencies in the reported find- 
ings suggest that the negative consequences of self-handicapping 
could be more or less pronounced under different circumstances. 
Leondari and Gonida (2007), for instance, reported that self- 
handicapping and achievement were more closely related in ele- 
mentary compared to high-school students. Another potential mod- 
erator might be the self-handicapping scale that is used. 
Questionnaires measuring habitual self-handicapping focus on dif- 
ferent constituent elements of self-handicapping (e.g., using the 
handicap as an excuse, the a priori timing of the strategy). As these 
elements themselves might be differently related to academic 
achievement, the choice of a self-handicapping scale might already 
predispose studies to reach dissimilar conclusions about the rela- 
tion of self-handicapping to achievement (Urdan & Midgley, 
2001). 

Several reviews on self-handicapping have been published, each 
of them addressing a special feature of this wide-spread phenom- 
enon such as the predictive roles of gender (Hirt & McCrea, 2009) 
or achievement goals (Urdan & Midgley, 2001). Surprisingly, 
however, there has been no meta-analytic investigation to date of 
the relation between self-handicapping and achievement in the 
academic domain. In this article, we present the first meta-analysis 
on this topic as we seek (a) to provide an empirically sustained 
estimation of the mean correlation between self-handicapping and 
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achievement in academic settings outside the laboratory and (b) to 
investigate the impact of several presumed moderators of this 
relation. 


The Nature of Self-Handicapping 


According to Berglas and Jones (1978), self-handicapping is 
defined as “any action or choice of performance setting that 
enhances the opportunity to externalize (or excuse) failure and 
to internalize . . . success” (p. 406). The impetus for self- 
handicapping is uncertainty about one’s ability, including antici- 
pated threats to one’s self-esteem (Berglas & Jones, 1978; Snyder 
& Smith, 1982). The postulated protective function of self- 
handicapping on self-esteem takes advantage of the discounting 
and augmentation principles of attribution (Kelley, 1971). In the 
event of failure, the presence of an impediment offers individuals 
the opportunity to shift attributions for poor performance from low 
ability (e.g., “I failed the exam because I’m stupid’’) to the hand- 
icap (e.g., “I failed the exam because I didn’t sleep well last 
night”). By doing this, ability will be discounted as a causal 
attribution, and one’s image of competence as well as one’s 
self-esteem will be buffered. If the individual surprisingly suc- 
ceeds, attributions to high ability will be augmented because the 
individual performed well despite the handicap (Tice, 1991). 

An important distinction in the literature has been drawn be- 
tween behavioral and claimed self-handicapping (Arkin & Baum- 
gardner, 1985; Leary & Shepperd, 1986). Behavioral self- 
handicapping implies an active acquisition of an impediment, such 
as drug abuse (Berglas & Jones, 1978), decreased practice time 
(Baumeister, Hamilton, & Tice, 1985), or choice of debilitating 
performance settings (Rhodewalt & Davison, 1986). By contrast, 
claimed self-handicappers only report the presence of obstacles. 
For example, they claim to suffer from test anxiety (Smith, Snyder, 
& Handelsman, 1982), physical symptoms (Smith, Snyder, & 
Perkins, 1983), or a bad mood (Baumgardner, Lake, & Arkin, 
1985). These two self-handicapping modes differ from one another 
in terms of cost-benefit analyses (Hirt, Deppe, & Gordon, 1991). 
On the one hand, behavioral handicaps are more credible because 
they are more convincingly tied to performance than claimed ones. 
For the same reason, however, behavioral handicaps are more 
costly. On the other hand, claimed handicaps, such as reports of 
test anxiety, also serve as an excuse for failure but do not neces- 
sarily decrease one’s chances of being successful as behavioral 
handicaps do (Hirt et al., 1991; Leary & Shepperd, 1986; Zuck- 
erman & Tsai, 2005). The conceptually meaningful distinction 
between claimed and behavioral self-handicapping notwithstand- 
ing, the majority of questionnaires designed for assessing habitual 
self-handicapping in the academic field clearly emphasize behav- 
ioral forms of handicapping and do not distinguish between 
claimed and behavioral self-handicapping. 


The Relation Between Academic Self-Handicapping 
and Achievement 


A theoretical framework for the relation between academic 
self-handicapping and achievement is provided by the Self- 
Handicapping and Self-Regulation Cycle (Rhodewalt & Tragakis, 
2002; Rhodewalt & Vohs, 2005). In this model, distal motives 
such as uncertain self-conceptions of competence or low self- 


esteem lead to decreased performance expectancies for upcoming 
tests. The lowered expectancies then serve as proximal motives to 
use self-handicapping for self-protection. Rhodewalt and Tragakis 
(2002) assumed that self-handicappers are primarily concerned 
about their self-worth and less about their actual performance. This 
skewed focus leads people to choose handicaps that—although 
effective in protecting their self-esteem—are really detrimental to 
their performance, such as procrastinating or drinking before an 
exam. The impaired performance has recursive effects on one’s 
self-perceptions of ability, thus reinitializing a new cycle of a 
threatened self-image, self-protection through self-handicapping, 
and lowered performance! (Zuckerman et al., 1998). Although the 
authors acknowledged that the degree to which performance is 
impaired may depend on the kind of handicap, they suggested that 
chronic self-handicapping has long-term negative effects on 
achievement. 

A lot of studies on self-handicapping have been conducted in 
experimental laboratory settings. Although this kind of research 
has provided valuable insights into the structure and mode of 
action of self-handicapping, findings from laboratory studies can- 
not be generalized to real-world classroom settings in school or 
college. Thus, laboratory studies do not contribute to the primary 
purpose of our meta-analysis, which is to unravel the mean asso- 
ciation between self-handicapping and achievement in academic 
settings in order to estimate the severity of the problem for every- 
day educational practice. Consequently, to be included in the 
following review, studies (a) must have been conducted in a field 
setting rather than in a laboratory setting, (b) must have included 
a sample comprised of school (i.e., elementary, middle, or high 
school) and/or college or university students, and (c) must have 
reported a correlation between a self-handicapping questionnaire 
and a measure of academic achievement (e.g., grade point average 
[GPA], test scores). 

Empirical studies conducted in the academic field with school or 
university students have reported quite inconsistent results. To 
categorize the respective effect sizes, we used the reference points 
for evaluating influences on academic achievement provided by 
Hattie (2009), who considered r = .10 to be small, r = .20 to be 
moderate, and r = .30 to be large effects. According to these 
guidelines, some studies have revealed only a small relation be- 
tween self-handicapping and achievement of r = —.08 (Urdan, 
2004) or r = —.07 (Wesley, 1994). In most cases, however, the 
two constructs were correlated to a moderately negative degree. In 
a sample of undergraduate psychology students, Elliot and Church 
(2003) found self-handicapping to be negatively associated with 
exam performance (r = —.15). Further examples of moderate 
effect sizes were obtained by Martin, Marsh, and Debus (2001b; 
r = —.19); Zuckerman et al. (1998; r = —.20); and McCrea and 


' The conceptual focus of this article is on the directional effects from 
self-handicapping to achievement. Given the reciprocal character of the 
relation between self-handicapping and achievement, we should have in- 
cluded only longitudinal studies in our meta-analysis in which earlier 
measures of self-handicapping determined later assessments of perfor- 
mance, Due to the small number of longitudinal studies, however, we 
decided to base the present meta-analysis on both longitudinal and cross- 
sectional studies. Throughout the article, we thereby seek to maintain the 
theoretical perspective of directional effects of self-handicapping on 
achievement, but we are also cautious about our use of causal wording as 
we acknowledge the correlational design of most studies reviewed here. 
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Hirt (2001; r = —.23). Contrary to these results, other authors have 
reported fairly larger correlations of, for instance, r = —.40 
(Midgley & Urdan, 2001), r = —.38 (Gadbois & Sturgeon, 2011), 
r = —.33 (Shih, 2005), and r = —.38 (Midgley & Urdan, 1995). 
In sum, findings have been characterized by substantial heteroge- 
neity with an apparent overlap of moderately negative effect sizes. 
At first glance, there are several explanations for these heteroge- 
neous correlations. 


Moderator Variables 


The studies that have been conducted on the relation between 
self-handicapping and academic achievement have differed in a 
number of parameters such as the questionnaires used for assessing 
self-handicapping, participants’ age, the indicators for achieve- 
ment, and so forth. We now elaborate on how differences in some 
of these aspects may produce theoretically plausible effects on the 
magnitude of the correlation between self-handicapping and 
achievement. 


Self-Handicapping Questionnaire 


One of the most apparent differences among studies examining 
the relation between self-handicapping and achievement is in the 
questionnaires that have been used to assess the participants’ 
self-handicapping. With some exceptions, researchers have mainly 
relied on either the Academic Self-Handicapping Scale (ASHS; 
Midgley & Urdan, 1995; Urdan et al., 1998) or the Self- 
Handicapping Scale (SHS; Jones & Rhodewalt, 1982). Whereas 
the six-item ASHS has been used in essentially the same form over 
time (Urdan & Midgley, 2001), the SHS has been used both in its 
original form with 25 items and in short versions with 10 items 
(Strube, 1986) and 14 items (Rhodewalt, 1990; Zuckerman et al., 
1998), respectively. Despite some overlap, the ASHS and the SHS 
show considerable differences in their operationalization of self- 
handicapping. In our view, the ASHS items were developed in a 
straightforward manner in conjunction with existing theory. Urdan 
and Midgley (2001) denoted three features necessary for a valid 
self-handicapping item. It has to include the handicapping behay- 
ior (e.g., effort withdrawal), the reason for this behavior (e.g., to 
use low effort as an excuse), and the a priori timing of the strategy 
(e.g., low effort as an excuse is installed before failure occurs). All 
ASHS items have been formulated in line with these recommen- 
dations (e.g., “Some students put off doing their school work until 
the last minute so that if they don’t do well on their work, they can 
say that is the reason. How true is this of you?”). By contrast, items 
from Jones and Rhodewalt’s (1982) SHS do not fully represent 
these criteria. Many items just ask for a behavior that has the 
potential to be a handicapping behavior, thereby leaving the reason 
for it open (e.g., “I tend to put things off until the last moment”; “I 
am easily distracted by noises or my own daydreaming when I try 
to read”). Likewise, another set of SHS items emphasize a person’s 
tendency to search for excuses in the case of failure. However, the 
a priori timing of installing the excuse before failure occurs is not 
integrated into these items (e.g., “I tend to make excuses when I do 
something wrong”). Altogether, the SHS items are only partially in 
line with Urdan and Midgley’s (2001) required features of a valid 
self-handicapping item, and the criteria that they fulfill are not 
consistent across all SHS items. 


In light of the differences described above, we believe that 
choosing either the ASHS or the SHS can be responsible for 
dissimilar correlations between self-handicapping and achieve- 
ment. So which kind of relation would be expected for which kind 
of questionnaire? We assume that agreeing with statements from 
the ASHS will have more deleterious effects on students’ perfor- 
mance than agreeing with items from the SHS. This seems rea- 
sonable to assume because the SHS does not necessarily assess 
individuals’ tendencies to self-handicap but rather measures some 
kind of undifferentiated avoidance behavior. High scores on the 
SHS might thus be justified by a number of reasons that may not 
have been considered when the items were formulated. Putting 
things off until the last moment, for instance, might reflect a 
stress-induced kind of time management that does not necessarily 
lead to lower achievement (Steel, 2007). Likewise, tending to use 
excuses after failure may be just an adaptive self-protective reac- 
tion that says nothing about one’s tendency to establish excuses 
before an important test or exam. 

Taken together, we think that the maladaptive effects of self- 
handicapping on achievement accumulate with each step of the 
self-handicapping process. Whereas executing a behavior that has 
the potential to handicap one’s test score may already be detri- 
mental, it is even worse when it is used as an excuse as well as 
when it is implemented right before the respective test situation. A 
person’s behavior must agree with all three criteria to paint the 
picture of a self-handicapping person who feels threatened by 
anticipated failure; who invests more time in ruminating about 
him- or herself than in learning; and who is prone to entering a 
vicious cycle of low performance, increased self-esteem threat, 
and repeated self-handicapping again. However, this form of aca- 
demic self-handicapping is exclusively represented by the ASHS, 
so it would be reasonable to assume that higher negative correla- 
tions between self-handicapping and achievement would be found 
in studies using this measure compared to the SHS. 

In addition to the ASHS and SHS, the Self-Sabotage subscale of 
the Motivation and Engagement Scale (MES; see Liem & Martin, 
2012, for an overview) has also been used in some studies. The 
MES was designed to represent 11 cognitive and behavioral di- 
mensions relevant to motivation and engagement as proposed 
by Martin (2007) in his Motivation and Engagement Wheel. In the 
MES, self-sabotage (in fact, often termed “self-handicapping” in 
more recent work) is measured with four items that were adapted 
from either the ASHS or the short version of the SHS (Strube, 
1986). In fact, however, all items fulfill the requirements for a 
valid self-handicapping item outlined by Urdan and Midgley 
(2001); thus, we assumed that the ASHS and the self-sabotage 
scale would yield very similar correlations with academic achieve- 
ment. 


School Type and Age 


In order to provide teachers and practitioners with information 
about the possible starting points of such vicious cycles, it is 
crucial to investigate potential age-related differences in the effects 
of self-handicapping on achievement. However, this topic has 
seldom been addressed in the literature. To our knowledge, Le- 
ondari and Gonida’s (2007) study is the only one that has provided 
a direct comparison between students of different ages. They found 
a moderately negative correlation between self-handicapping and 
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mathematics achievement in elementary and junior high school 
students, whereas there was a null relation in senior high school 
students. Descriptive inspections of other studies have supported 
these results. At first glance, studies examining samples of college 
or university students seem to have reported lower correlations 
(e.g., Martin et al., 2001b) compared to studies investigating 
samples of younger school students (e.g., Midgley & Urdan, 
2001). This overview of empirical evidence suggests that a meta- 
analytic investigation could also reveal meaningful age-related 
differences. 

These empirical findings notwithstanding, theoretical justifica- 
tions for age differences have been scarce in the self-handicapping 
literature. Zuckerman et al. (1998) even hypothesized that chronic 
self-handicappers should show an accelerated decline in academic 
achievement as the vicious circle they engage in results in changes 
in academic achievement for the worse. However, the results 
depicted above contradict this hypothesis. One possible explana- 
tion for the findings reported above can be found in the different 
grading structures that are used in elementary and middle school. 
In the grading systems used by teachers of younger students, more 
emphasis is placed on soft factors such as motivation to learn, 
classroom behavior, and so forth than in the systems used by 
teachers of older students (McMillan, 2001; Remesal, 2011). Thus, 
effort withdrawal or other kinds of handicapping behaviors could 
contribute a particularly large amount of negative weight toward 
the grades of younger students. This could lead to an inflated 
correlation between self-handicapping and achievement in 
younger students. 

Another possible explanation might come from developmental 
differences in students’ self-evaluations of their ability. Young 
children do not have a clearly differentiated definition of academic 
competence (Marsh, 1992; Stipek & Mac Iver, 1989); thus, they 
might interpret failure in a specific domain to mean that they are 
generally less capable in school. Such generalization processes 
could induce a self-reinforcing cycle of lowered success expectan- 
cies and self-handicapping in all domains, which, in turn, might 
accumulate into more deleterious effects on performance com- 
pared to students with more differentiated self-perceptions of their 
ability. Overall, we expected to identify age-related differences in 
the correlation between self-handicapping and achievement in the 
present meta-analysis. Those differences might be explained by 
developmental issues as described above. 


Gender 


In numerous studies, men have been found to show higher 
scores on self-handicapping questionnaires than women (e.g., 
Midgley & Urdan, 1995, 2001; Urdan et al., 1998). These results 
and also the fact that women use behavioral forms of self- 
handicapping less frequently than men represent very robust find- 
ings in self-handicapping research (see Hirt & McCrea, 2009, for 
a review). Whereas all people prefer claimed over behavioral 
self-handicapping, this tendency is quite a bit more observable in 
women than in men (Hirt et al., 1991). Researchers have struggled 
for a long time to explain these mean differences in behavioral 
self-handicapping. To date, the most prominent assumption 
stresses the differential valuing of effort. McCrea, Hirt, Hendrix, 
Milner, and Steele (2008) introduced the Worker scale, an instru- 
ment that assesses the extent to which an individual sees him/ 


herself as a hard worker and personally values these characteris- 
tics. The authors found that women tend to score higher on this 
measure. Moreover, the Worker scale partially mediated the rela- 
tion between gender and behavioral self-handicapping (i.e., 
women showed higher scores on the Worker scale but lower scores 
on behavioral self-handicapping). 

Only a few studies have reported correlations between self- 
handicapping and achievement separately for women and men. 
Whereas McCrea et al. (2008) reported similar correlations be- 
tween self-handicapping and GPA for women and men, Wesley 
(1994) found a substantially higher correlation for men (r = —.36) 
compared to women (r = —.23). Due to this small database, we 
deemed it important to meta-analytically examine the potential 
moderating effect of gender. From a conceptual standpoint, the 
Worker scale findings by McCrea et al. (2008) may be helpful for 
illuminating the extent to which gender moderates the correlation 
between self-handicapping and achievement. Women seem to be 
smarter when choosing a self-protection strategy. With respect to 
self-handicapping, they tend to rely on claimed rather than behav- 
ioral self-handicapping (Hirt & McCrea, 2009). However, because 
all questionnaires included in this meta-analysis put more empha- 
sis on behavioral forms of self-handicapping, this aspect alone was 
not expected to produce a moderating effect of gender. It was thus 
more important to consider whether women would self-handicap 
in a smarter way than men even when they handicapped behav- 
iorally, too. For example, women may be more sensitive to having 
an adequate degree of effort withdrawal, thereby not reducing their 
effort more than necessary. If true, this would be reflected in a 
substantially lower correlation between self-handicapping and ac- 
ademic achievement for women compared to men. 


Achievement Goals 


Achievement goals refer to the reasons why people engage in 
achievement-related situations. Recent frameworks mainly differ- 
entiate four different achievement goals (Elliot & McGregor, 
2001): mastery-approach goals (enhancing task-based or intraper- 
sonal competence), mastery-avoidance goals (avoiding task-based 
or intrapersonal incompetence), performance-approach goals (demon- 
strating high ability or competence to others), and performance- 
avoidance goals (avoiding appearing incompetent to others). Sev- 
eral studies have found positive links between self-handicapping 
and performance-avoidance goals (e.g., Elliot & Church, 2003; 
Leondari & Gonida, 2007) but negative links between self- 
handicapping and mastery goals (e.g., Martin et al., 2001b; Midg- 
ley & Urdan, 2001). 

However, students can pursue multiple goals simultaneously. 
Schwinger and Stiensmeier-Pelster (2011) reported that the addi- 
tional endorsement of mastery-approach goals buffered the rela- 
tion between performance-avoidance goals and self-handicapping 
most likely because mastery goals decreased the self-esteem- 
threatening effect of anticipated failure by suggesting attributions 
of failure that emphasize controllable factors such as low effort. 
The relation between self-handicapping and academic achieve- 
ment may be influenced in a similar way: The more students also 
pursue mastery goals, the more they might attribute ongoing fail- 
ure to controllable factors that might help them to find a way out 
of the self-handicapping cycle. Based on this reasoning, we as- 
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sumed that correlations between self-handicapping and achieve- 
ment would be lower for highly mastery-oriented students. 


Self-Esteem, Self-Efficacy, and Academic Self-Concept 


Cognitive and affective self-evaluations also serve as determi- 
nants of self-handicapping, whereby positive views of the self are 
generally associated with less of a need to handicap (Schwinger & 
Stiensmeier-Pelster, 2012; Zuckerman et al., 1998). Parallel to the 
reasoning for achievement goals, we assume that positive self- 
views suggest more adaptive attributions of one’s performance, 
and thus, students with positive self-views should subsequently 
self-handicap for a shorter time period and/or less strongly. Con- 
sequently, we proposed that the negative effects of self- 
handicapping on achievement would be lower for students with 
rather positive compared to negative self-perceptions. 


Level of Achievement 


Most handicapping behaviors are supposed to interfere with 
deep and successful learning. It might make a difference, however, 
if the handicapping student has usually performed well or poorly in 
the past. As previous knowledge and performance in a given 
domain are among the most powerful predictors of further achieve- 
ment (Hattie, 2009; Steinmayr & Spinath, 2009), it seems reason- 
able to assume that previously low-achieving students will further 
fall behind after using self-handicapping strategies. By contrast, 
usually high-achieving but occasionally handicapping students 
will be able to draw on their accumulated knowledge, which may 
result in only a marginal drop—if there is any drop at all—in 
school performance. Thus, we deemed it possible that the relation 
between self-handicapping and achievement would be higher for 
low-achieving compared to high-achieving students. 


Origin of the Sample 


As self-handicapping represents a strategy for regulating one’s 
self-esteem, it is important to consider how individual self-esteem 
is psychologically construed. Self-construal is supposed to differ 
as a function of individualism-collectivism. In individualistic cul- 
tures, the self is construed in independent terms as a separate, 
distinct entity, and the main task of the person is to “stand out” by 
distinguishing oneself from others through self-sufficiency and 
personal accomplishment. In collectivistic cultures, the self is 
construed in interdependent terms as a connected relational entity, 
and the main task of the person is to “fit in” by maintaining 
interpersonal relationships and group harmony (Markus & Ki- 
tayama, 1991). In individualistic cultures such as the United States, 
the attainment of positive outcomes is emphasized and valued, 
whereas in collectivistic cultures such as South Korea and Russia, 
avoiding negative outcomes is emphasized and valued (Elliot, 
Chirkov, Kim, & Sheldon, 2001). Drawing on these findings, we 
assume that self-handicapping is seen more positively in collec- 
tivistic cultures, probably yielding to a more benign evaluation 
of self-handicapping persons. More benign performance evalu- 
ations, in turn, might reduce the negative relation between 
self-handicapping and academic achievement. 


Ethnicity 


Several studies have addressed the question of whether students 
belonging to cultural minorities (e.g., African American) might be 
more prone to self-handicapping. Urdan and Midgley (2001) ar- 
gued that stereotype threat among minorities makes self- 
handicapping more likely in these cultural groups. For example, 
when African American students are concerned with appearing 
academically able, the threat of fulfilling a negative stereotype 
about African Americans’ low academic ability is activated and 
there is a greater need to avoid appearing academically unable than 
there is for European American students, who have no such ste- 
reotype. Such processes may result in higher handicapping for 
students from any stereotype-threatened cultural group (Urdan et 
al., 1998). Even more important, however, stereotype attitudes 
may also be activated in the teachers. There is ample evidence that 
such stereotype attitudes bias teachers’ interpretations of both the 
adaptive and maladaptive behaviors of their students (Gunderson, 
Ramirez, Levine, & Beilock, 2012). In this regard, teachers might 
see their stereotype fulfilled in self-handicapping minority students 
and may therefore assign immoderately bad grades to these stu- 
dents. To our knowledge, only a few studies (e.g., Midgley, 
Arunkumar, & Urdan, 1996) have reported correlations between 
self-handicapping and achievement separately for different ethnic 
groups, so we deemed it necessary to investigate this issue meta- 
analytically here. 


Achievement Indicators 


The relation between self-handicapping and achievement may 
depend on the achievement indicators considered. Meta-analyses 
on related constructs have sometimes yielded substantial differ- 
ences for distinct measures of achievement. In his meta-analysis on 
procrastination, Steel (2007) reported a significantly weaker neg- 
ative relation between procrastination and overall GPA compared 
to course-specific GPA. Investigating the meta-analytic effects of 
achievement goals, Wirthwein, Sparfeldt, Pinquart, Wegerer, and 
Steinmayr (2013) found significantly higher negative correlations 
between performance avoidance goals and standardized test scores 
as well as specific exam grades compared to GPA (see also Huang, 
2012). For mastery-approach and performance-approach goals, 
lower correlations emerged for standardized achievement test 
scores compared to other achievement indicators. Similar results 
were found in several meta-analyses on the academic self-concept 
(Hansford & Hattie, 1982; Huang, 2011). For our meta-analysis, 
we decided that it would be crucial to distinguish between stan- 
dardized achievement tests and teacher-assigned grades. With re- 
spect to the latter, it would be further important to differentiate 
between a specific grade on one exam and/or in one school subject 
from averaged measures such as GPA. As stated above, grades are 
biased to a certain degree by teachers’ perceptions. With respect to 
self-handicapping, we suggest that teachers do not appreciate 
finding handicapping behavior in their classes. Specifically, they 
might interpret such behavior as laziness, and this might lead them 
to lower the self-handicapper’s grade (Covington & Omelich, 
1979). Therefore, we predicted that the negative relation between 
self-handicapping and achievement would be higher when grades 
were used as achievement indicators compared to standardized test 
scores. 
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Publication Status 


Significant findings are more likely to be accepted for publica- 
tion than nonsignificant ones (Ferguson & Brannick, 2012). This 
publication bias may be even larger for constructs for which rather 
high correlations with respective outcomes are theoretically pre- 
sumed. Given that self-handicapping is already defined as a self- 
impeding and performance-inhibiting strategy, it can be extremely 
difficult for researchers to publish data showing null relations 
between self-handicapping and achievement. We therefore sought 
to establish whether the respective correlations would be signifi- 
cantly different for published versus unpublished studies. 


Concurrent Versus Prospective Measurement 


Several authors have noted that the relation between self- 
handicapping and achievement is probably reciprocal (Covington, 
1992; McCrea et al., 2008; Zuckerman et al., 1998). That is, fear 
of failing on a test leads to self-handicapping, which, in turn, 
decreases one’s performance. This low performance even further 
enhances self-doubts about mastering the next test, and these 
self-doubts lead to the necessity to self-handicap again. Concep- 
tually, we are interested in the causal effects of self-handicapping 
on achievement and not vice versa. Due to the small number of 
longitudinal studies, however, we had to base the present meta- 
analysis on both longitudinal and cross-sectional studies. How- 
ever, if we were to find that the prospective effects of self- 
handicapping on achievement were substantially lower compared 
to concurrent correlational effects, this may provide some hints 
that at least some amount of the correlation can be attributed to the 
directional effect of achievement on self-handicapping. 


Reliability of the Self-Handicapping Scale 


Another methodological aspect frequently considered in meta- 
analyses is the quality of the individual studies. However, because 
of its subjectivity, considering a study’s quality is seen as quite 
controversial in the literature (e.g., Jini, Witschi, Bloch, & Egger, 
1999). Hence, we focused on the reliability (Cronbach’s alpha) of 
the self-handicapping scale as a selected objective criterion for the 
methodological quality of a study. In this context, we assumed that 
the higher the reliability, the higher the correlation between self- 
handicapping and achievement would be. 


Specificity of the Self-Handicapping Measurement 


Several constructs related to academic self-handicapping have 
been found to be more predictive of achievement when they were 
measured in a domain-specific manner. Hansford and Hattie 
(1982), for instance, revealed mean correlations between domain- 
specific self-concepts and achievement that were almost twice as 
high as those between general measures of self-concept and 
achievement. Similar results were obtained by Huang (2011) and 
Valentine, DuBois, and Cooper (2004). With respect to achieve- 
ment goals, Huang (2012) reported that domain-specific measures 
of performance-avoidance goals were more strongly related to 
achievement than general ones. Based on these findings, we as- 
sumed that the relation between self-handicapping and achieve- 
ment would be higher when self-handicapping was measured 
domain-specifically. 


Achievement Domain 


As an extension of the issue of domain-specificity, we were also 
interested in whether self-handicapping would be more likely to 
produce deleterious effects on achievement in certain school do- 
mains. In particular, we sought to examine differences between 
mathematics and language subjects, as we presumed a distinct 
self-handicapping potential for math. In mathematics, students 
might experience tasks as solvable or not, which would serve as 
indirect feedback concerning intelligence for the student. There- 
fore, the degree to which a student would attribute failure to 
internally stable reasons (e.g., low intelligence) in mathematics 
may be higher compared to other subjects (e.g., English). Ability 
attributions of failure can lead to self-esteem threat, which, in turn, 
might enhance the probability that one will use self-handicapping 
strategies. Haag and Gotz (2012) reported that students perceived 
the characteristics of mathematics completely differently than 
those of English. Math was especially characterized as having a 
high potential to induce self-threatening events and, thus, a high 
potential for self-handicapping. Most notably, students rated it as 
a subject in which one needs to be intelligent because diligence 
alone is not enough to obtain good grades. Unlike English, more- 
over, the right solutions for tasks in math were rated as clear and 
without ambiguity. Finally, math was characterized as more ef- 
fortful and more difficult than English. Because self-handicapping 
thus seems to be used more frequently in math, we suggest that the 
risk of becoming a chronic self-handicapper in math should also be 
considerably higher. Habitual self-handicapping in a domain, how- 
ever, increasingly undermines achievement, and this is why we 
predicted that there would be larger negative correlations between 
self-handicapping and achievement in math compared to language 
subjects. Supporting these ideas, Huang’s (2012) meta-analysis 
revealed that performance-avoidance goals displayed the highest 
negative effects on achievement in math, whereas Wirthwein et al. 
(2013) found that the achievement indicator was just a relevant 
moderator for performance-approach goals. 


Domain Matching 


The relevance of assessing all variables on the same level of 
specificity has already been discussed elsewhere (e.g., Baranik, 
Barron, & Finney, 2010). In line with these previous studies, we 
expected the correlations between self-handicapping and achieve- 
ment to be higher when the levels of specificity were matched, that 
is, when both variables were assessed on a global or domain- 
specific level (e.g., global self-handicapping and GPA or self- 
handicapping in math and achievement in math) compared to a 
“mismatch” (e.g., global self-handicapping and achievement in 
math). In the related field of achievement-goal research, Wirth- 
wein et al. (2013) and Huang (2012) found some evidence that this 
“domain matching” moderated the associations between mastery 
goals and achievement as well as between performance-avoidance 
goals and achievement. 


The Present Research 


There are several reasons that support why a meta-analysis on 
the relation between self-handicapping and achievement is clearly 
needed. First, the heterogeneity in correlations between self- 
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handicapping and achievement has not yet been addressed meta- 
analytically. Likewise, there has been no literature review that has 
addressed this topic in more detail. Existing reviews have rather 
been dedicated to determinants of self-handicapping such as gen- 
der (Hirt & McCrea, 2009), claimed versus behavioral handicap- 
ping (Leary & Shepperd, 1986), or achievement goals (Urdan & 
Midgley, 2001). As a consequence, little is known about (a) the 
mean effect size of the relation between self-handicapping and 
achievement and (b) which factors are the most relevant modera- 
tors of this effect. A further justification for the current meta- 
analysis is that it will help to rank self-handicapping as a deter- 
minant of school performance. In his synthesis of meta-analyses, 
Hattie (2009) ranked the most common predictors of school per- 
formance by their mean effect sizes, thereby providing guidance 
on the importance of each construct for educational practice. 
Certainly, more research efforts will be put toward psychological 
intervention programs for the variables with higher effect sizes. 
Unraveling the mean effect of self-handicapping on achievement 
will allow us to compare this effect with those of other important 
school performance predictors and to estimate the relative neces- 
sity of designing intervention programs against self-handicapping. 

In this article, we present the first meta-analysis on the relation 
between self-handicapping and achievement in the academic do- 
main. We sought to explore the mean effect size of the correlation 
between self-handicapping and achievement as well as to explain 
the heterogeneity in empirical findings by identifying significant 
moderators. Specifically, we examined the moderating impact of 
(a) different self-handicapping questionnaires; (b) school type; (c) 
different sample characteristics; (d) achievement goals; (e) self- 
esteem, self-efficacy, and academic self-concept; (f) level of 
achievement; (g) methodological aspects (publication type, time of 
measurement, reliability of the self-handicapping scale); (h) dif- 
ferent achievement indicators; (i) specificity of self-handicapping 
measurement; (j) different achievement domains; and (k) domain 
matching. 


Method 


Literature Search and Coding 


We systematically searched electronic databases (i.e., Psy- 
cINFO, ERIC, Google Scholar, Web of Knowledge, ProQuest 
Dissertations & Theses, OpenGrey, NTiS, PSYNDEX) for ab- 
stracts that contained one of the search terms self-handicapping, 
self-sabotage, self-deception, self-defeating behavior, safeguard- 
ing, self-deceiving, self-impairment, effort withdrawal,  self- 
impediment, or self-hindering and either the term achievement or 
the term performance. We included studies up to August 2013. 
This search led to a maximum of 366 abstracts to be checked. We 
also used cross-referencing to identify relevant studies. Studies 
were then included in the meta-analysis if (a) correlations between 
self-handicapping and academic achievement were specified (ex- 
cluding achievement in sports), (b) the research was conducted 
with self-report measures, (c) the sample comprised school or 
university students, and (d) the study was written in English or 
German. Moreover, we personally contacted known authors in the 
field and asked them for unpublished studies or unpublished data 
that matched the aforementioned criteria. By applying these crite- 


ria, we identified 36 studies and 49 effect sizes (see Table | for a 
list of the included studies with selected descriptive statistics). 

Table 2 shows the moderator variables that we considered, the 
coding of the categories, and the respective kappa coefficients. We 
computed Cohen’s kappa to calculate the coding reliability of the 
variables (Cohen, 1992). Therefore, a second coder additionally 
categorized 10 of the included studies (i.e., about 34%). Kappas 
between .61 and .80 were classified as substantial and between .81 
and 1.00 as excellent (Landis & Koch, 1977). As can be seen, the 
coding reliabilities of each item were at least substantial with 
kappas ranging from .68 to 1.00. With two exceptions regarding 
the categories school type (xk = .68) and questionnaire (k = .74), 
all kappas could be classified as excellent. 


Effect Size Calculation and Analyses of Effect Sizes 


We considered the Pearson product moment correlation coeffi- 
cient as the effect size statistic for our meta-analysis. Outliers were 
defined as correlations that were more than 2 SDs above or below 
the mean of the correlation (this was true for just two correlations). 
According to Lipsey and Wilson (2001), we winsorized these two 
outliers to a less extreme value (2 SDs). If multiple effect sizes 
were reported in one study (e.g., different academic achievement 
indicators), we combined them into a single effect size using 
Fisher’s Z scores to avoid dependency in the data (Lipsey & 
Wilson, 2001). If results in one study were reported for different 
groups (e.g., gender, age), they were handled as two distinct and 
independent results. 

With reference to the total effect, we chose a priori to integrate 
effect sizes by using the random effects model (REM; Hedges & 
Vevea, 1998), as its theoretical postulate allows the individual true 
effects to differ. We used a restricted maximum likelihood 
(REML) estimator of the variance of the true effects. As demon- 
strated by Monte Carlo simulations (Viechtbauer, 2005), this es- 
timator is efficient and has few biases. Each correlation coefficient 
was transformed into a Fisher’s Z score, and each effect size was 
weighted according to the REM by the inverse of the sum of the 
sampling variance and the estimated variance between the true effects 
(Lipsey & Wilson, 2001). To compute an estimate for the mean of 
the true effects of the individual studies, each effect size was 
multiplied by its weight, and the sum of these products was divided 
by the sum of the weights. To gain further insights into the 
heterogeneity of the effects, we used Cochran’s Q-Test for homo- 
geneity (see Hedges & Olkin, 1985) and the F° statistic (Higgins & 
Thompson, 2002). Finally, all weighted mean effect sizes and 
corresponding confidence intervals were converted back to Pear- 
son product moment correlation coefficients. 

Additionally, we used procedures to assess whether the results 
could have been affected by publication bias (Rothstein, Sutton, & 
Borenstein, 2005), which refers to an overestimation of the aver- 
age true effect due to the circumstance that published studies have 
larger effects than unpublished documents. First, we inspected 
funnel plots (Light, Singer, & Willett, 1994), which plot the 
individual effect sizes against their corresponding standard errors. 
An asymmetric distribution of the effect sizes around the estimated 
mean of the true effect can signal that the sample of the included 
studies is potentially biased. Second, the funnel plots were statis- 
tically tested for asymmetry with a rank correlation test (Begg & 
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Table | 


Included Studies With Selected Descriptive Statistics 
ee re ee ee ee 


Study 


Boon (2007) 

Clarke et al. (2013) 

Cocorada (2011) 

De Castella et al. (2013) 

Elliot & Church (2003) 

Feick & Rhodewalt (1997) 

Gadbois (2013) 

Gadbois (2013) 

Gadbois & Sturgeon (2011) 

Kleitman & Gibson (2011) 

Leondari & Gonida (2007), Study 1 
Leondari & Gonida (2007), Study 2 
Leondari & Gonida (2007), Study 3 
Martin (2003), Study 1 

Martin (2003), Study 2 

Martin & Hau (2010), Study 1 

Martin & Hau (2010), Study 2 

Martin et al. (2001a) 

Martin et al. (2001b) 

Martin et al. (2003) 

Martin et al. (2013) 

McCrae (2013a) 

McCrae (2013a) 

McCrae (2013b) 

McCrea & Hirt (2001) 

McCrea et al. (2008), Study 1 (Male) 
McCrea et al. (2008), Study 1 (Female) 
McCrea et al. (2008), Study 2 (Male) 
McCrea et al. (2008), Study 2 (Female) 
Midgley et al. (1996) 

Midgley & Urdan (1995) 

Midgley & Urdan (2001) 

Murray & Warden (1992) 

Plenty & Heubeck (2011) 

Rhodewalt & Hill (1995) 

Schwinger (2013), Study 1 

Schwinger (2013), Study 2 

Schwinger & Kreppold (2012) 
Schwinger & Stiensmeier-Pelster (2010) 
Schwinger & Stiensmeier-Pelster (2012), Study 1 
Schwinger & Stiensmeier-Pelster (2012), Study 2 
Shih (2005) 

Thomas & Gadbois (2007) 

Turner et al. (2002) 

Urdan (2004) 

Urdan et al. (1998) 

Wesley (1994), Study 1 

Wesley (1994), Study 2 

Zuckerman et al. (1998) 
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Note. School type: 1 = elementary school, 2 = middle school, 3 = high school, 4 = university; Gender: 
1 = =59% female, 2 = >59% female; Questionnaire: 1 = Academic Self-Handicapping Scale (Midgley & 
Urdan, 1995), 2 = Self-Handicapping Scale (Jones & Rhodewalt, 1982), 3 = Short Self-Handicapping Scale 
(Strube, 1986), 4 = Mixed (Midgley & Urdan, 1995; Strube, 1986), 5 = Motivation and Engagement Scale 
Self-Sabotage subscale (Liem & Martin, 2012), 6 = Others; AI = academic achievement indicator: 1 = grade 
point average, 2 = achievement test, 3 = exam grade, 4 = school report-card grade. 


Mazumdar, 1994) and a regression test (Egger, Davey Smith, 
Schneider, & Minder, 1997). 

To examine potential moderators, we referred to the meta- 
analytic mixed effects model (MEM; Hedges & Olkin, 1985; 
Raudenbush, 2009), which transfers the REM to the fixed values 
of potential moderators. We analyzed the variability in the effect 


sizes due to differences between the categories of the respective 
potential moderator (e.g., different achievement indicators) with a 
weighted meta-analytic analogue to the analysis of variance. A 
statistically significant Q,,-score implies that the mean effect sizes 
of the groups or categories of the respective moderator differ by 
more than sampling error (Lipsey & Wilson, 2001). We referred to 
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Table 2 
Coding Scheme and Interrater Reliability 


Variable 


Coding K 


ls A ES et ls ee ee eS 


Questionnaire 


1 = Academic SHS (Midgley & Urdan, 1995); 2 = SHS (Jones & 74 


Rhodewalt, 1982); 3 = Mixture (Midgley & Urdan, 1995; Strube, 
1986); 4 = Others 


— 


School type 


= elementary school; 2 = middle school; 3 = high school; 4 = .68 


university; 5 = mixed 


Gender 1 = <59% female; 2 = >59% female 1.00 
Achievement goals; self-esteem; self-efficacy; academic 
self-concept; achievement level in the sample 1 = low; 2 = medium sized/average; 3 = high 1.00 

Origin of the sample 1 = United States; 2 = Europe; 3 = Asia; 4 = Australia; 5 = South 90 
America; 6 = Canada 

Achievement indicator 1 = GPA; 2 = achievement test score (SAT, etc.); 3 = exam grade; pou 
4 = semester/school report-card grade; 5 = mixed 

Time 1 = concurrent assessment; 2 = prospective assessment 87 

Specificity self-handicapping - = general/school/university; 2 = math and sciences; 3 = others 1.00 

Achievement domain 1 = general/school/university; 2 = math and sciences; 3 = languages so: 


and other subjects; 4 = mixed 


Domain matching 





1 = mismatch; 2 = match 1.00 


Note. SHS = Self-Handicapping Scale; GPA = grade point average; SAT = Scholastic Aptitude Test. 


the principle that nonoverlapping 95% confidence intervals (CIs) 
indicate a meaningful difference between two effect sizes (see 
Lipsey & Wilson, 2001) to obtain information on the discrepancy 
between two specific categories. For continuous variables such as 
ethnicity or the reliability of the self-handicapping scales, 
weighted least squares (WLS) meta-regression analyses (see Steel 
& Kammeyer-Mueller, 2002; Viechtbauer, 2008) were used. One 
meta-analysis of variance (ANOVA) or meta-regression analysis 
was conducted for each moderator. In addition, a multiple WLS 
meta-regression was conducted to identify the relevance of one 
moderator compared to others. 

To obtain all relevant information for the moderator analyses 
regarding the categories “achievement indicator,” “specificity of 
the self-handicapping-scale,” “achievement domain,” and “domain 
matching,” all correlation coefficients in each study were used 
(and not the averaged correlations). The above-mentioned analyses 
were conducted with the SPSS macros developed by Lipsey and 
Wilson (2001) and with the metafor package (Viechtbauer, 2010) 
for R (R Development Core Team, 2010). 


Results 


Mean Effect Size 


The 36 included studies comprised k = 49 independent 
samples with a total of N = 25,550 participants (range: N = 43 
to N = 6,366). Publications came from the United States 
(38.8%), Europe (30.6%), or Australia (20.4%). The studies 
were published between the years 1992 and 2013. University 
students comprised 51% of the participants, and 49% were 
school students. 

The random-effects model revealed a mean correlation between 
self-handicapping and achievement of r = —.23 (p < .001; range: 
r= —.46 tor = .02; 95% CI [—.25, —.20]; k = 49). A forest plot 
of the included studies can be found in Figure 1. According to the 
standard already provided, the correlation is of a medium size 
(Hattie, 2009). Next, we tested whether this mean effect size could 


be influenced by publication bias. An inspection of the funnel plot 
in Figure 2 showed that the individual effects were not asymmet- 
rically distributed around this estimate of the mean of the true 
effects. This impression was confirmed by statistical tests of funnel 
plot asymmetry. Neither the rank correlation test (Kendall’s tT = 
0.01, p = .90) nor the regression test (z = —0.36, p = .72) 
indicated a funnel plot asymmetry, so there were no indications 
that the findings were biased. 

Cochran’s Q-Test suggested heterogeneity (Q = 156.50, df = 
48, p < .001), which means that the individual observed effects 
differed more than would be expected if sampling error were the 
only source of variability. Hence, there were differences among 
the true effects. The amount of total variability between the ob- 
served effect sizes that was due to heterogeneity was estimated to 
be I? = 70.32%, 95% CI [55.72%, 82.96%], and could be classi- 
fied as “high” (Higgins & Thompson, 2002). Subsequently, we 
tested whether the heterogeneity could be (at least) partially ex- 
plained by the variables that we considered as potential modera- 
tors. 


Moderator Analyses 


Table 3 provides an overview of the moderator analyses.* 
Four significant moderator variables were identified. The larg- 
est moderator effect was found for “school type” (Qz = 18.83, 
p < .01). The confidence intervals of the categories “elemen- 
tary school™ (7 =)>229;.95% Cl; —.35.= r= —.23) and 
“middletschool” (r= —.3495%eCl? =.41 = r=, —.25) 
compared to “high school” (r = —.23; 95% CI: —.26 = 
r = —.18) and “university students” (7.= —.18; 95% CI: 
—.21 =r = —.14) did not overlap, indicating a statistically 
significant difference between these correlations. Furthermore, 


* We also examined possible moderating effects of language (German, 
English), research group (groups of authors/researchers), and year of 
publication. However, because we found no significant effects, we decided 
not to report these moderators in the article. 
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Author(s) and Year Observed Effect [95%Ci] 
Rhodewalt & Hill, 1995 ft 2 [ + 2 

Feick & Rhodewalt, 1997 a ve i ‘ o a 
Leonderi & Gonida, 2007 b+} -0.08 [ -0.22 0.06 } 
Urdan et al., 1998 f— Manes -0.08[-0.16. 0.00) 
MoCrea et al., 2008 eee teh -0.08[-0.19. 0.01} 
MeCrea et al. 2008 bd} 0.09 { -0.15 0 03] 
McCrea et al. 2008 fretted -0.09[-0.26 , 0.08} 
Elliot & Church, 2003 Ferri -O10[ -0.24 0.05] 
McCrea, 2013a ba 7 “DAO, -049 ‘0.02 ] 
MoCrea et al, 2008 oe } -0.12[-0.21 ,-0.03] 
Wesley, 1994 bt -0.12(-0.39, 045] 
Murray & VVarcen, 1992 b-—-+-—___{ -O.14[-0.25 , 0.00) 
Schwinger & Stiensmeier-Pelster, 2010 bated | -~H414[-0.24 ,-0.04} 
McCrea, 20138 fel ~O16[-0.34, 0.01} 
tvfartin et al, 2003 fend : -O.17 [ -0.26 , -0.09 ] 
Gadbois, 2013 freer -018[-0.45, 0.09] 
Plenty & Heubeck, 2014 a aaa { -~O.49[-0.25 ,-043) 
Martin et al., 2001b Fndilinnet ; -0.19[-0.27 ,-0.11) 
Martin et al, 20018 bh —a——-J -0.20[-0.31 , -0.09 } 
Zuckerman et al. 1998 4+ 0.20 [ -0.32 , -0.08 } 
Martin & Hau, 2010 feel -0.21 [-0.30 ,-0.13} 
Boon, 2007 Hy -0,22(-0.29 ,-016} 
Leondari & Gonida, 2007 Forti -0.22[-035 ,-0.10) 
Thomas & Gadbois, 2007 Premed ~0.22[-0.38 , -0.07 ] 
Schwinger & Stiensmeier-Pelster, 2012 bo -0.23[-0.31 ,-0.15} 
Schwinger, 2013 Pt -0.23 [ -0.30 ,-0.18 ] 
Martin & Hau, 2040 b-—+»——_ ~0.23[-0.35 , -0.12] 
McCrea & Hirt, 2004 frre -0,.23 [ -0.39 , -0.08 } 
Martin et al., 2013 + -0.24[-0.31 ,-0.418) 
Schwinger & Kreppotd, 2012 -a -0.26(-0.31 ,-0419} 
Leondari & Gonida, 2007 Sorelle 0.27 [-0.39 ,-0.14) 
Wesley, 1994  roumanrs tama | -O.27[-0.41 , -0.43) 
Turner et al., 2002 lanl) -0.28 [ -0.34 ,-0.22} 
Cecorada, 2041 Peepers remind -0,29[ -0.42 ,-0.16 } 
Martin et al,, 2003 tilt ~0.30[-0,95 , -0.25 } 
De Castella et al, 2013 a -0,32 { 0.40 , -0.24] 
Midgley et al., 1996 fr Birt ; -1.32[-0.51 ,-043} 
Schwinger & Stiensmeier-Pelster, 2012 Eerie Metre -0.33 [ -0.49 ,-0.416] 
Martin et al., 2003 Fee treed -0.33{-0.45 ,-0.21} 
Schwinger, 2013 Frere rma -0.33[-D.52 , -0.14] 
Urdan, 2004 ae ae -0.33[-0.42 ,-0.25 } 
Kleitman & Gibson, 2011 Prarie rae Wrenner -~O,.34[-0.49 ,-0.19)} 
Shih, 2005 rer ~0.34[-0.47 | -0.22] 
Gadbois & Sturgeon, 2011 Prt -0.36 [ -0.49 , -0.22} 
Clarke et al., 2013 fr -O.36[-058 ,-044] 
Gadbois, 2013 Fa es ey -O.37 [ -0.68 , -0.06 ] 
Midgley & Urdan, 1995 Pome ieerenmert 0.40 [-0.52 ,-0.28] 
Midgley & Urdan, 2001 oe -0.42[-0.51 ,-0.33] 
McCrea, 201 3b eRe ae -0.46[-0.73 ,-019] 
RE Model 5 : -0.23[-0.25 , -0.20) 

Pia ct ee , 
-0.80 -0.60 -0.40 ~0.20 0.00 0.20 0.40 
Observed Effect 
Figure 1. Forest plot of self-handicapping and achievement (correlations transformed into Fisher’s Z scores). 


CI = confidence interval; RE = random effects. 


the variable “questionnaire” emerged as a significant moderator 
of the mean correlation between self-handicapping and achieve- 
ment (Q, = 24.74, p < .0i). Lower correlations were found for 
(nemesis. (Ga—— = Oo Cle — ol ==77==) —.06) compared 10 
ine categories “Academic SHS” (7 = —.25; 95% CI:*=.29 = 
[ane vibe. = 29, 9570 Cl: 13) = r= = 121), 
Mastery-approach goals were an additional statistically signif- 
icant moderator (Q, = 8.93, p < .01): We found higher effect 
sizes in samples with medium sized mastery-approach goals 
(rf = —.38; 95% Cl: —.45 = r = —.30) compared to samples 
with high mastery-approach goals (r = —.25; 95% Cl: —.28 S 
r <= —.20). The reliability of the self-handicapping was an 
additional significant moderator (unstandardized b = —0.67, 
aN) aly 

The effect sizes for published studies were comparable to un- 
published studies, as can be seen in Table 3. Origin of the sample 


was not a significant moderator as was indicated by similar cor- 
relations between self-handicapping and achievement in the 
United States, European, and Australian samples. The regression 
analysis with the continuous variable ethnicity (percentage White 
within a sample, just available for eight studies) was not significant 
as well (unstandardized b = —0.00, z = —0.35, p = .73). With 
respect to gender, there were slightly higher effect sizes in samples 
with larger proportions of males. However, the differences were 
not significant. Similar findings were obtained for the moderator 
“time.” Although the mean correlation appeared to be higher for 
concurrent assessments of self-handicapping and achievement, the 
Q,,-index did not reach significance. The mean effect sizes were 
also similar across several indicators of academic achievement, 
notwithstanding the smaller correlation between self-handicapping 
and test scores. Moreover, the specificity of the self-handicapping 
measurement, the achievement domain considered, and domain 
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0.040 0.000 


Standard Error 
0.079 


0.119 





0.158 


-0.60 -0.40 -0.20 0.00 


Observed Effect 


Figure 2. Funnel plot of self-handicapping and achievement. 


matching did not serve as significant moderators. Performance- 
approach or -avoidance goals and the level of self-esteem, self- 
efficacy, or academic self-concept did not serve as moderator 
variables either. In this context, it has to be noted that the catego- 
ries for these variables were comprised of only a few studies (see 
Table 3). 


Multiple WLS Meta-Regression Analysis 


Potential meta-analytic moderators can sometimes be con- 
founded, and this can result in spurious effects (Lipsey, 2003). 
With respect to the present results, we were interested in the 
stand-alone influences of the identified moderators. Therefore, we 
examined the relative effects of the three moderators school type, 
questionnaire, and reliability of the self-handicapping scales in a 
multiple WLS meta-regression analysis (see Steel & Kammeyer- 
Mueller, 2002; Viechtbauer, 2008). Due to the small number of 
existing studies, we decided to eliminate the moderator “mastery 
goals” from the multiple regression. Based on the results of the 
moderator analyses, we created two dummy variables: school type 
(1 = elementary and middle school students; 0 = high school and 
university students) and questionnaire (1 = others; 0 = SHS). We 
chose these categories for the regression analysis because the 
confidence intervals between them did not overlap. The total 
regression model was significant (Q,,,,g¢, = 39.50, df = 3, p < .01; 
k = 36) and accounted for 67.39% of the variance between the true 
effect sizes (1.e., of the heterogeneity). In the combined analysis, 
the variables school type (unstandardized b = —0.09, z = —3.16, 


p < .01) and questionnaire (unstandardized b = —0O.17, 
z = —3.53, p < .01) retained their moderating effects.* 
Discussion 


The present meta-analysis had two purposes. First, we aimed 
to unravel the mean correlation between self-handicapping and 
achievement in the academic domain. Such an analysis allowed 


us to determine the relative importance of self-handicapping 
compared to other predictors of academic performance (Hattie, 
2009). Second, we sought to identify relevant moderators of the 
relation between self-handicapping and achievement. Based on 
conceptual investigations of the two main instruments for as- 
sessing self-handicapping, we were mainly interested in 
whether using the ASHS (Urdan et al., 1998) versus the SHS 
(Jones & Rhodewalt, 1982) would yield a substantive modera- 
tor effect. 


Mean Effect Size 


We identified 36 studies with 49 independent effect sizes and 
25,550 participants. The mean correlation between  self- 
handicapping and achievement was moderately negative 
(r = —.23), indicating that the frequent use of self-handicapping is 
probably associated with poor performance. In his synthesis of 
over 800 meta-analyses, Hattie (2009) reported mean effect sizes 
for a wide range of individual and contextual predictors of aca- 
demic performance. In Hattie’s metric, a correlation of r = .20 
(which equals d = 0.40) reflects a moderate effect size. Given that 
similar effect sizes were found for prominent predictors of school 
performance, such as the ability self-concept (d = 0.43) or intrin- 
sic motivation (d = 0.48), it can be concluded that self- 
handicapping represents a meaningful correlate of academic 
achievement. Even more important, several predictors that had 
received great attention in educational psychology research were 
found to have surprisingly small effects on academic outcomes. 
For instance, Hattie (2009) reported effects of d = 0.12 for gender, 
d = 0.29 for homework, d = —0.18 for television watching, and 
d = 0.18 for web-based learning. These findings stress the relative 
importance of self-handicapping for academic achievement even 
further. 


Moderator Analyses 


The second aim of the current meta-analysis was to analyze the 
relevance of potential moderator variables (e.g., self-handicapping 
questionnaire, school type, gender, achievement indicator, concur- 
rent vs. prospective measurement, specificity of the self- 
handicapping measurement, achievement domain, domain match- 
ing). The need for moderator analyses was stressed by the 
heterogeneity between individual effect sizes. Four moderators 
were significant in the univariate analyses, namely, school type, 
the respective self-handicapping questionnaire, the level of mas- 
tery goals in the sample, and the reliability of the self- 
handicapping scale. In the multiple meta-regression analysis, 
school type and questionnaire remained statistically significant and 
were able to explain most of the existing variability between the 
effect sizes. 

The finding regarding school type is in line with our assumption 
that younger students show higher relations between self- 


* We report unstandardized regression coefficients, as they are more 
easily interpreted than standardized regression coefficients in the case of 
dummy-coded predictor variables. The unstandardized coefficients display 
the difference between the mean effect sizes of the two categories (e.g., 
| = elementary and middle school students; 0 = high school and university 
students) of the respective moderator variable (e.g., school type) when the 
other moderator (e.g., questionnaire) is statistically controlled for. 
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Table 3 
Results of the Moderator Analyses 





Moderator k ES SE 95% CI Z Op 
Questionnaire 24.74* 
Academic SHS (Midgley & Urdan, 1995) 28 = 25" 02 Os) ee DESI, 
SHS (Jones & Rhodewalt, 1982) 11 alti .03 ae OO) —4.27 
Short SHS (Strube, 1986) 2 eo .08 a Oa OS 92. 
Mixed (Midgley & Urdan, 1995; Strube, 
1986) 4 0a 04 Se Omens 50) 
MES (Liem & Martin, 2012) 6 ane .03 et letee() —8.49 
Others 8 ae .05 Sel eel —4.18 
School type 18.83** 
Elementary 6 a2 .03 aa) eS —9.20 
Middle school 4 oe 04 A ey) SOS 
High school 12 oie 02 = Oy il —10.14 
University 25 —.18** 02 —.21, —.14 —9.67 
Gender 2.84 
=59% female 26 = ar 02 a mice Ol eh2.05 
>59% female 17 oe .02 eae) LOS 
Mastery-approach goals Qa 
Medium sized/average 3 =,33"* 05 —.45, —.30 —8.78 
High 10 ere .02 OO) 22 
Performance-approach goals 0.29 
Medium sized/average i oon .06 ee Os mellic) —4.73 
High 1 ool aly See On) —2.08 
Performance-avoidance goals 0.48 
Medium sized/average 5 ae oe .07 130) 09 = S132 
High 1 =.36" 18 = (2, = Oil PACD) 
Self-efficacy 0.74 
Medium sized/average 1 OOS mL) eS MS ail eo 
High 8 a 2a .03 Soe aL —7.93 
Self-esteem 0.22 
Medium sized/average 2 oe .09 =44, —.13 =3.50 
High if Seo .05 Sea ml ato 
Academic self-concept 2.89 
Medium sized/average 2 le .04 Se teal Se 
High 4 Oe .02 as) —14.81 
Achievement : 
Low 2 oi 08 23 Osten OU —2.91 
Middle 33 sal .02 Sear oars —10.81 
High 10 Syne .04 ae Otel ae 
Origin of the sample 3.32 
United States 24 Oa .02 4S —9.15 
Europe 11 24 .03 = KO), alts ESS 
Australia 12 25 .03 eo ieee) oS) 
Asia A) DO 07 Sar ieee —4.00 
Achievement 2.66 
GPA 24 oon .02 = Ones — 10.04 
Achievement test score 18 eos .02 —.24, —.14 leis 
Exam grade 17 a .03 eco eS —7.96 
Semester/school report-card grade 21 —.24"" .02 Se, sO —10,49 
Publication type 0.03 
Published 42 SE .02 eer =14.72 
Unpublished 7 one .04 = O24 Soule 
Concurrent 35 ae Oia 02 = 8, SPM) — 14.34 
Prospective 12 lo .03 = 2) uae lS =) 
Specificity of self-handicapping measurement 3.76 
General/school 38 en .02 Sy HPA) —14.18 
Math and science 8 4” 03 Sonn —7.46 
Others 2; ald .06 Oa yn0 17.6 
Achievement domain 3.90 
General/school 36 Onn .02 DA arti — 10.96 
Math and science 20 ae .02 Set) ete —10.56 
Other subjects 14 oan .03 oe li —6.92 
Languages JE —.25"" 04 130; lls, ON 
Mixed 3 mail t .06 Mea OS) SiO 
Domain matching 1.75 


(table continues) 
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Table 3 (continued) 








Moderator k ES 
Match 49 
Mismatch Bil 





ee le 
eae 


SE 95% Cl Zz On 
lee eee ere eee 

02 —.24, —.18 11.94 

02 —.28, ~.21 —13.74 


Note. k= number of effect sizes; ES = mean effect size; SE = standard error of ES; 95% CI = lower and upper limits of 95% confidence interval; Si 
z test for significance of r; Q, = homogeneity estimate; SHS = Self-Handicapping Scale; MES = Motivation and Engagement Scale; GPA = grade point 


average. 


7) S Ws, a = AMI 


handicapping and performance. This might be due to differences in 
grading structures (e.g., teachers of younger students might weight 
self-handicapping more negatively when assigning grades) and/or 
age-related developments (e.g., due to a poorly differentiated abil- 
ity self-concept, specific failures may be interpreted to mean that 
the student is less capable in school in general). Moreover, the 
moderator questionnaire remained statistically significant in the 
multiple meta-regression as well. As argued in the theory section, 
the instruments used to assess self-handicapping differ in several 
ways. In fact, the SHS items are only partially in line with Urdan 
and Midgley’s (2001) required features of a valid self- 
handicapping item, and the criteria that the items meet are not 
consistent across all SHS items. Moreover, the SHS assesses rather 
undifferentiated avoidance behavior, and agreement with items on 
the SHS can be justified by several reasons other than self- 
handicapping. However, self-handicapping becomes maladaptive 
for academic performance when the various aspects of self- 
handicapping all come together. Just showing a potential handi- 
capping behavior, such as procrastination, is only one part of the 
self-handicapping construct. The more important parts include the 
a priori timing of the strategy and the reason for the behavior (e.g., 
procrastinating in order to have a handicap in case of failure). 
Because these aspects are more strongly represented by the ASHS 
and the respective MES subscale, it seems reasonable that the 
correlation between self-handicapping and achievement would be 
considerably higher when using these instruments. However, we 
cannot rule out alternative explanations of the moderating effects 
of the different questionnaires. With regard to the underlying 
specificity of self-handicapping measures, for instance, the ASHS 
and the MES measure self-handicapping more directly in terms of 
concrete behaviors, whereas the SHS rather assesses individual 
differences in the tendency to engage in self-handicapping behav- 
iors. That is, the SHS operationalizes self-handicapping as a more 
distal construct like a broad personality trait. As a consequence, 
one might attribute the different effect sizes for the SHS versus 
other questionnaires to some kind of a bandwidth-fidelity problem 
(Baranik et al., 2010), resulting in the broader handicapping mea- 
sure being less predictive of important outcomes. Future studies 
should examine this alternative interpretation in more detail. 

Our findings have several important implications for both self- 
handicapping research and educational practice. First, one could 
argue that studies using the SHS are not informative when esti- 
mating the correlation between self-handicapping and achieve- 
ment. It is thus crucial that self-handicapping researchers discuss 
the construct validity of the available questionnaires in order to 
gain a precise understanding of the maladaptive effects of self- 
handicapping in the academic domain. Second, age-related differ- 
ences in the effects of self-handicapping on achievement should be 


considered: The correlations were more highly negative when the 
students were in elementary or middle school compared to high 
school or university. Third, we provided several reasons for the 
poor construct validity of the SHS. Researchers are thus cautioned 
to check the face validity of self-handicapping items before using 
them in their studies. 

Participants’ gender was not found to be a significant moderator 
in our meta-analysis. Although it might be plausible that women 
are somewhat “smarter” about choosing the kind and the degree of 
the handicap (e.g., women may be more sensitive about not re- 
ducing their effort more than necessary), this assumption was not 
supported by the present data. More sophisticated investigations 
may clarify whether the consequences of self-handicapping are 
really the same for women and men. 

A current meta-analysis on the relation between achievement 
goals and academic achievement identified the respective achieve- 
ment indicator as a relevant moderator variable (Wirthwein et al., 
2013). Surprisingly, our moderator analyses revealed no signifi- 
cant effects for different indicators of academic achievement. 
Although the correlation with test scores was slightly smaller 
compared to the other indicators, the differences were not signif- 
icant. We additionally could not find a moderating effect for 
“achievement domain.” That is, the mean effects were similar 
across several domains such as mathematics and languages. Given 
the considerable effect size when using GPA as the criterion, our 
findings are in line with numerous studies that have emphasized 
the negative long-term effects of academic self-handicapping (e.g., 
Martin et al., 2001a; Midgley & Urdan, 2001; Zuckerman et al., 
1998). However, similar effect sizes were found for the more 
specific achievement criteria “exam grade” and “school report- 
card grade”; thus, these findings similarity contradict the often- 
claimed statement that singular self-handicapping events are less 
costly to performance. It is thus possible that students enter the 
“vicious cycle” of low performance and self-handicapping after 
some singular handicapping situations. To verify this conclusion, 
future studies could examine the longitudinal trajectories of stu- 
dents’ handicapping-performance relations in more detail. 

It is also important to note that self-handicapping had similar 
effects when using test scores versus teacher-assigned grades as 
the achievement criterion. This finding underlines the maladaptive 
impact of self-handicapping on performance. To some degree, 
negative effects on grades might be attributed to biased grading 
practices by teachers. Randall and Engelhard (2010) provided 
teachers with scenarios that described student ability, achieve- 
ment, behavior, and effort, and asked them to assign both a 
numerical and letter grade. Results showed that teachers based 
their grades primarily on performance but to a smaller extent also 
on nonachievement indicators. Because teachers often interpret 
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self-handicapping negatively, it is possible that self-handicappers 
received extra deductions in marks, which would have resulted in 
an overestimation of the correlation between self-handicapping 
and achievement when using grades as the achievement criterion. 
However, because we failed to find a moderating effect of the 
achievement indicator, we conclude that the negative effects of 
self-handicapping are reflected not only in subjective measures of 
performance but also in objective ones. 

The mean correlation was similar when self-handicapping was 
assessed with global versus domain-specific measures. Moreover, 
in contrast to recent meta-analyses on achievement goals (Huang, 
2012; Wirthwein et al., 2013), there was no effect of domain- 
matching. These results suggest that self-handicapping represents 
a rather global construct that has similar effects across different 
contexts such as school domains. However, further research is 
needed to explore the extent to which self-handicapping in one 
domain (e.g., mathematics) generalizes to other school subjects 
(e.g., languages, natural sciences). 

One of the most important questions in psychological research 
refers to the causality of relations, that is, the chicken—egg prob- 
lem. In this meta-analysis, we were interested in the effects of 
self-handicapping on achievement, thereby implying that the for- 
mer causally determines the latter. However, we also agree with 
several authors who have proposed a reciprocal effect between the 
two constructs (Zuckerman et al., 1998). That we did not find 
significant differences in concurrent versus prospective correla- 
tions may be interpreted to mean that the cross-sectional correla- 
tion coefficient might yet provide an acceptable estimation of the 
causal effect of self-handicapping on academic achievement. It has 
to be noted, however, that we did not control for previous achieve- 
ment or for previous self-handicapping. Such cross-lagged analy- 
ses could yield more satisfactory conclusions about the topic of 
causality in future research. 

A further as-yet-unmentioned reason for the substantial variabil- 
ity in correlation coefficients may be the distinction between 
behavioral and claimed self-handicapping (e.g., Arkin & Baum- 
gardner, 1985). It is obvious that just claiming to have a handicap 
(such as pretending to have test anxiety or physical symptoms) is 
not necessarily accompanied by poor performance, whereas the 
active acquisition of an impediment (e.g., alcohol abuse) should be 
more likely to decrease one’s performance. In this regard, it is 
interesting that the SHS includes a larger number of claimed 
handicapping items than the ASHS and the MES. However, the 
vast majority of questionnaire studies in the field have focused on 
the association between only a combined self-handicapping scale 
(i.e., one that does not differentiate between claimed and behav- 
ioral self-handicapping) and academic achievement. Hence, we 
were not able to take this distinction into account with regard to 
the mean effect size. Future research should explicitly separate 
these two self-handicapping strategies on questionnaires and ana- 
lyze the individual consequences for achievement or achievement- 
related behavior (McCrea et al., 2008). Moreover, it would be 
interesting to examine different forms of self-handicapping sepa- 
rately (such as procrastinating or claiming test anxiety). Due to the 
small number of studies that have investigated the association 
between self-handicapping and achievement and the lack of a 
differentiated self-handicapping scale, we were not able to analyze 
this aspect in the current meta-analysis. 


We found smaller associations between self-handicapping 
and achievement when the level of mastery-approach goals was 
high. This result should be interpreted rather cautiously because 
of the small number of studies considered and also because it 
might be attributed to the restricted variance in the group of the 
highly mastery-oriented students. This caveat notwithstanding, 
our findings suggest that a high level of mastery goals buffers 
the maladaptive effects of self-handicapping on achievement, 
an interpretation that appears to be reasonable from a concep- 
tual perspective. At a certain point during the self-handicapping 
process, students realize that this behavior impedes their per- 
formance, Primarily performance-oriented students would then 
attribute this growing failure to internal, stable, and uncontrol- 
lable factors (e.g., low intelligence), and this attribution would 
probably reinforce the presumed reciprocal cycle of self- 
handicapping and low performance and lead them to continue 
handicapping. However, the additional activation of a mastery 
goal orientation might lead students to see the failure from a 
different perspective and to attribute it to controllable factors 
(Schwinger & Stiensmeier-Pelster, 2011). If successful, this 
might help students to significantly reduce the amount and/or 
duration of self-handicapping. 

Taken together, our analyses revealed important moderators of 
the relationship between self-handicapping and achievement. 
However, the number of non-significant moderators in our study is 
also remarkable. The fact that self-handicapping cuts across so 
many very different groups and contexts is a major finding of this 
meta-analysis and it underlines the universality of self- 
handicapping effects. These results seem to indicate that self- 
handicapping is more trait-like than sometimes presumed or at 
least influenced by trait-like drivers such as fear of failure or 
self-esteem (Rhodewalt & Tragakis, 2002). 


Self-Handicapping Interventions 


Given the considerable correlation with achievement and the 
substantial heterogeneity in effect sizes, it seems necessary to 
develop adequate educational interventions against _ self- 
handicapping. To date, specific trainings that focus explicitly on 
reducing self-handicapping are barely available. Kearns, Forbes, 
and Gardiner (2007) conducted a cognitive behavioral coaching 
intervention (CBC) with doctoral students in order to reduce 
perfectionism and self-handicapping. In a 6-week workshop series, 
participants learned to alter inaccurate cognitive assumptions 
about themselves through the use of several CBC techniques such 
as cognitive restructuring, thought diaries, and the normalizing of 
one’s thoughts. Results revealed a significant decrease in self- 
handicapping at the follow-up assessment 4 weeks after the inter- 
vention. Martin (2005) implemented a series of workshops target- 
ing students’ motivation and engagement. Measurement involved 
the Motivation and Engagement Scale—High School (MES-HS; 
Martin, 2007) at the outset of the program, toward the end of the 
program, and again 6—8 weeks later. Data showed a significant 
reduction in self-handicapping as well as gains on key facets of 
students’ motivation by the end of the program and also 6-8 
weeks later. 

Our results suggest that fostering mastery-approach goals in 
students might also help to reduce the amount of self-handicapping 
and its negative impact on academic achievement. Mastery- 
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oriented students believe that the self and performance are mal- 
leable (Blackwell, Trzesniewski, & Dweck, 2007) and that self- 
worth is not contingent on one’s abilities. Consequently, they do 
not interpret failure as feedback concerning their self-esteem, but 
they view negative task experiences as opportunities for personal 
growth. Moreover, mastery-oriented learners tend to judge nega- 
tive feedback more positively. Because they are focused on indi- 
vidual reference norms, they are more likely to attribute failure to 
modifiable and controllable factors such as low effort (Ames, 
1992). In line with these considerations, Schwinger and 
Stiensmeier-Pelster (2011) reported that both performance- 
avoidance goals and low self-esteem had a lower impact on self- 
handicapping when mastery goals were also highly salient. Like- 
wise, the data presented here revealed that the (additional) pursuit 
of mastery goals reduces the maladaptive effects of self- 
handicapping on academic performance. A possible interpretation 
may be that mastery-oriented students are more flexible in man- 
aging the duration and intensity of self-handicapping, and such an 
ability may help them to avoid rather extreme declines in academic 
performance. However, the processes behind the observed moder- 
ating effect of mastery goals remain speculative here, and further 
research is needed to disentangle them in more detail. 

Altogether, there seem to be promising avenues by which to 
reduce self-handicapping, but only some of them have already 
been explored. These sporadic endeavors need to be extended to 
standardized intervention programs that are designed to prevent or 
minimize self-handicapping and that can be effectively applied 
across different forms of schooling and cultural contexts. The 
present meta-analysis has set the stage for understanding the 
importance of preventing students from becoming chronic self- 
handicappers. 


Limitations and Suggestions for Further Research 


Some limitations need to be considered regarding the present 
meta-analysis. Unfortunately, due to the lack of studies that in- 
cluded the variables that we targeted in our moderator analyses, 
not all studies could be included in the moderator analyses. With 
respect to the selection of moderator variables, future investiga- 
tions could focus on additional moderators such as the degree to 
which a task is challenging or difficult. In addition, it would be 
interesting to analyze the associations of self-handicapping not 
only with academic achievement indicators but also with other 
outcome variables such as interest, academic engagement, or the 
use of specific learning strategies. An important issue for further 
meta-analytic research on the relation between self-handicapping 
and achievement would be to take longitudinal cross-lagged ef- 
fects into account (cf. Huang, 2011). Such examinations could 
shed more light on the question of which variable causes the other. 
Furthermore, it might be interesting to conduct a meta-analysis on 
the antecedents of self-handicapping so researchers can establish a 
rank order of the most relevant risk factors for self-handicapping. 
Finally, the present meta-analysis was based on questionnaire data 
only. Future studies may wish to focus on the large number of 
experimental studies in the field. 

Despite the above-mentioned limitations, we believe that the 
present meta-analysis has provided important insights into the 
effects of self-handicapping on achievement in the academic do- 
main. Considering the relevant literature up to August 2013, our 


findings explicated the relative importance of self-handicapping as 
a correlate of academic achievement. The results should also 
caution researchers about which instrument to use to assess self- 
handicapping, as this may have an effect on the validity of the 
findings. Taken together, these findings have utility for practitio- 
ners, researchers, and theorists as they seek to reduce maladaptive 
psycho-behavioral strategies that ultimately limit students’ aca- 
demic potential. 
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We examined classrooms as complex systems that affect students’ literacy learning through interacting 
effects of content and amount of time individual students spent in literacy instruction along with the 
global quality of the classroom learning environment. We observed 27 3rd-grade classrooms serving 315 
target students using 2 different observation systems. The first assessed instruction at a more micro level, 
specifically, the amount of time individual students spent in literacy instruction defined by the type of 
instruction, role of the teacher, and content. The second assessed the quality of the classroom learning 
environment at a more macro level, focusing on classroom organization, teacher responsiveness, and 
support for vocabulary and language. Results revealed that both global quality of the classroom learning 
environment and time individual students spent in specific types of literacy instruction covering specific 
content interacted to predict students’ comprehension and vocabulary gains, whereas neither system 
alone did. These findings support a dynamic systems model of how individual children learn in the 
context of classroom literacy instruction and the classroom learning environment, which can help to 
improve observations systems, advance research, elevate teacher evaluation and professional develop- 
ment, and enhance student achievement. 


Keywords; reading, classroom observation, child individual differences, intervention, language, differ- 


entiated instruction 


Reading comprehension and vocabulary have been identified as 
strong predictors of future academic success (National Institute of 
Child Health and Human Development [NICHD], 2000) as well as 
of overall school and life outcomes (Beck, McKeown, & Kucan, 
2002). Yet, by the end of fourth grade, only about 34% of US. 
students are reading and comprehending proficiently (National 
Center for Education Statistics, 2013). Accumulating research 


points to the importance of classroom literacy instruction and the 
opportunities to learn that students receive in the early grades 
(Connor et al., 2013; NICHD, 2000; Pianta, Belsky, Houts, & 
Morrison, 2007; Snow, 2001; Tuyay, Jennings, & Dixon, 1995). 
Understanding the classroom learning environment is important, and 
finding ways to elucidate the active ingredients of this environment 
that predict student outcomes are essential but challenging. 
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In this study, we used a dynamic systems framework (Yo- 
shikawa & Hsueh, 2001), which holds that there are multiple 
sources of influence on children’s learning (Bronfenbrenner & 
Morris, 2006), including the instruction they receive, how this 
instruction is delivered (Connor, Piasta, et al., 2009; Reis, Mc- 
Coach, Little, Muller, & Kaniskan, 2011), the general climate of 
the classroom (Rimm-Kaufman, La Paro, Downer, & Pianta, 
2005), teacher characteristics (Raver, Blair, & Li-Grining, 2011), 
and students themselves (Connor & Morrison, 2012; Justice, Pet- 
scher, Schatschneider, & Mashburn, 2011). Further, these sources 
of influence interact in different ways, with some seemingly im- 
portant factors (e.g., teacher education) having relatively small 
effects on students’ reading development (Goldhaber & Anthony, 
2003) and other factors (e.g., content and minutes of instruction) 
having large effects (Connor, Morrison, Schatschneider, et al., 
2011). High-quality literacy instruction should provide students 
with individualized opportunities to learn that, in turn, influence 
their reading comprehension and language development (Beck et 
al., 2002; Beck, Perfetti, & McKeown, 1982; Connor et al., 2013; 
Snow, 2001). Thus, there is an increasing policy and research 
focus on how to measure classroom instruction in ways that validly 
and robustly predict gains in students’ literacy and vocabulary 
skills (see Crawford, Zucker, Williams, Bhavsar, & Landry, in 
press; Kane, Staiger, & McCaffrey, 2012; Ramey & Ramey, 2006; 
Reddy, Fabiano, Dudek, & Hsu, in press; Whitehurst et al., 1988). 
The aim of this study was to systematically investigate the 
classroom-learning environment as a dynamic system, identify 
major dimensions of classroom instruction—at both the individual 
student level and the global classroom level—that may influence 
students’ literacy achievement, and determine how these dimensions 
might work together synergistically to support (or fail to support) 
opportunities for learning that result in gains in third graders’ vocab- 
ulary and reading comprehension. 


Classroom Observation Systems 


Teacher value-added scores have revealed that there is measur- 
able variability in the effectiveness of teaching, which has direct 
implications for students’ success or failure (Konstantopoulos & 
Chung, 2011). However, value-added scores do not reveal what is 
going on in the classroom and the characteristics of the environ- 
ment that explain the variability in teachers’ value-added scores. 
The development of rigorous classroom observation systems that 
are reliable and have good predictive validity are important be- 
cause they help to open up the black box of classroom instruction, 
so to speak, and begin to move us toward what has been described 
as “shared instructional regimes” (Raudenbush, 2009). Rauden- 
bush (2009) described historical and recent theories of teaching as 
“privatized idiosyncratic practice” (p. 172) whereby teachers close 
their classroom doors and teach in the ways they believe to be best 
and where the ideal teacher develops his or her own curriculum. 
The “idiosyncratic” practice of teachers who have a good grasp of 
the current research, who have expert and specialized knowledge 
of their content area, and who understand how to use research 
evidence to inform their practice can be highly effective. However, 
the privatized idiosyncratic practice of some teachers may be 
highly ineffective (Piasta, Connor, Fishman, & Morrison, 2009), 
particularly for children from low-socioeconomic status families 
whose home learning environment and access to resources is 


limited and who are more reliant on the instruction they receive at 
school. Research-based observation tools allow us to illustrate 
what effective expert practice in the classroom actually looks like 
so that it can be shared among a community of professionals— 
both educators and researchers—to improve teaching. 

There are several well-documented observation systems in use 
with new systems being developed (Connor, 2013a). These class- 
room observation systems provide important insights, and most of 
them explain at least modest amounts of the variance in students’ 
literacy learning. For example, Kane and colleagues (2012) tested 
several observation instruments, including the Framework for 
Teaching (Danielson, 2007), CLASS (Pianta, La Paro, & Hamre, 
2008), Protocol for Language Arts Teaching (Grossman et al., 
2010), Mathematical Quality of Instruction (Hill, Ball, Bass, & 
Schilling, 2006), and UTeach Teacher Observation Protocol 
(2009). Results revealed that although none of the systems de- 
signed to assess English/Language Arts instruction correlated with 
teacher value-added scores computed using state-mandated assess- 
ments of English/Language Arts, they were mildly to moderately 
positively correlated with teacher value-added scores computed 
using the SAT-9 reading assessment. 


Classroom Observations in the Present Study 


We used two different observation coding systems to test the 
dynamic systems model of instruction in the present study: the 
quality of the classroom learning environment (CLE) and Individ- 
ualizing Student Instruction (ISD/Pathways-observation system 
(ISI/Pathways; Connor, Morrison, et al., 2009). The first was 
designed to capture the global quality of the CLE using a rubric 
that captured elements of the CLE that are generally predictive of 
student outcomes. The second, ISI/Pathways, was designed to 
record the amount of time individual students spent in various 
types of literacy instruction, the content of this instruction, the role 
of the teacher, and the context (e.g., whole class, small group) in 
which instruction was provided. We conjectured, following the 
dynamic systems model, that classroom opportunities to learn 
would operate at both student and classroom levels and that the 
two systems together might better elucidate the complexities of the 
classroom and effective learning opportunities afforded to students 
than either system alone. We describe each below. 


Quality of the CLE Rating Scale 


The CLE rating scale (see Appendix A) was designed to rate the 
classroom on three dimensions: Teacher Warmth, Responsiveness 
and Discipline; Classroom Organization; and Teacher Support for 
Vocabulary and Language Development—with one rating for each 
scale for the entire observation of the literacy block. Teacher 
warmth, responsiveness, and discipline were defined as teachers’ 
regard for their students, the overall emotional climate of the 
classroom, as well as the way in which they responded to students, 
particularly with regard to how they responded to student misbe- 
havior and disruptions (Pianta, La Paro, Payne, Cox, & Bradley, 
2002). Examples of teacher warmth and responsiveness include 
being supportive of students, providing positive feedback, clearly 
communicating what is expected of students, and providing disci- 
pline in a positive and supportive way (Rimm-Kaufman et al., 
2005). The kinds of discussions and types of questions used, for 
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example, coaching versus telling (Taylor & Pearson, 2002), were 
measured indirectly through this dimension. Research has shown 
that students whose teachers were more warm and responsive 
achieved greater gains in reading skills, including vocabulary, by 
the end of first grade (Connor, Son, Hindman, & Morrison, 2005). 

Classroom organization is defined as the degree to which the 
teacher takes time to give students thorough directions for upcom- 
ing activities, has clear rules for behavior, and has established 
routines that optimize student learning time (Wharton-McDonald, 
Pressley, & Hampston, 1998). When teachers have strong orient- 
ing and organizational skills, they are better able to create an 
efficient and productive CLE. It has been found that teachers who 
implement rules and effectively establish routines are less likely to 
have difficulties with classroom management (Borko & Niles, 
1987; Cameron, Connor, Morrison, & Jewkes, 2008). 

According to Beck and colleagues (2002), teacher support for 
vocabulary and language development should be “robust,” mean- 
ing that instruction should include activities beyond those that 
encourage rote memorization of words and their definitions and, 
instead, involve rich contexts that extend beyond the classroom. 
Such support has the potential to improve language skills overall 
given that processes where vocabulary knowledge is highly used 
(e.g., during reading comprehension) require skills above and 
beyond knowing the definitions of words. Therefore, instructional 
techniques that take into consideration that vocabulary knowledge 
is part of students’ background knowledge (Stahl, 1999), rather 
than a singular component, are more likely to be effective. Such 
techniques encourage students to actively use and think about 
word meanings and create word associations in multiple contexts. 
Accumulating evidence further highlights the importance of sup- 
porting student language development because vocabulary (and 
oral language skills in general) are highly predictive of students’ 
reading comprehension (Biemiller & Boote, 2006; Cain, Oakhill, 


Table 1 


& Lemmon, 2004) and, at the most basic level, allow students to 
understand grade-level texts. Yet, despite the known importance of 
vocabulary and language skills to later reading abilities and sub- 
sequent academic success, it has been shown that vocabulary 
instruction is often missing from language arts/literacy classrooms 
(see Cassidy & Cassidy, 2005/2006; Rupley, Logan, & Nichols, 
1998). 


The ISI/Pathways Classroom Observation System 
(ISI/Pathways) 


The ISI/Pathways system (Connor, Morrison, et al., 2009) mea- 
sures the amount of time (minutes;seconds) individual students 
within a classroom spend in literacy instruction activities across 
three dimensions: content of instruction; context; and the role of 
the teacher and student in the learning activity. The content of 
instruction dimension (see Table 1 and Appendix B) captures the 
specific topic of the literacy instruction that individual students are 
receiving (e.g., comprehension, vocabulary, text reading). We also 
coded noninstructional activities, which included off-task or dis- 
ruptive behaviors, transitions between activities, or time spent 
when the teacher was giving directions for upcoming activities. 
The context dimension captures the student-grouping arrangement 
and includes whole class, small group, or individual instruction. 
Management captures who is controlling the students’ attention 
during an activity: the teacher and student working together 
(teacher—child managed), peer-managed (students working with 
each other), or child—self-managed (student managing his or her 
own attention). These dimensions operate simultaneously (see 
Table 1) to describe instructional and noninstructional activities 
observed during reading instruction. For example, the teacher and 
students discussing a book they just read together would be coded 


Dimensions of Instruction (Context, Grouping, and Management) and Content Areas Associated With Code- and Meaning-Focused 


Types of Instruction 





Variable 


Teacher/child-managed, Whole class 


prefixes and suffixes.* 
Teacher/child-managed, Small group 


Child/peer-managed, Small group and 
individual self-managed 

Content areas (in the coding system) Phonological Awareness 

Morphological Awareness” 

Word Identification/Decoding 

Word Identification/Encoding 


Grapheme/Phoneme correspondence 


Fluency™ 


Code-focused 


The teacher is teaching the class how to decode multisyllabic 
words by writing them on the white board and then 
demonstrating various strategies, such as looking for 


The teacher is working with a small group of children on 


spelling (1.e., encoding) strategies. 


Students are working together in pairs to complete a 
worksheet on dividing multisyllabic words into syllables. 


Meaning-focused 


The teacher is reading A Single Shard to 
the class. She stops every so often to 
ask the students questions. 


The teacher and a small group of students 
are discussing Mr. Poppers’ Penguins 
and how it is similar and different from 
Charlotte’s Web. 

Students are working individually to 
revise an argumentative essay using 
feedback from their peers. 

Print and Text Concepts 

Oral Language 

Print Vocabulary 

Listening and Reading Comprehension 

Text Reading 

Writing 


ou 
“It can be argued that morphological awareness (Carlisle, 2000) and fluency (Therrien, 2004) might also be considered meaning-focused activities. In our 
theory of literacy instruction, code-focused activities represent the more automatic processes, whereas meaning-focused activities require the integration 
of the more automatic processes with active construction of meaning of connected text. Hence, code-focused activities are more likely to directly affect 
aspects of reading related to skill, whereas meaning-focused activities are more likely to directly contribute to aspects of comprehension and reading for 
understanding. 
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as comprehension (meaning-focused) teacher/child managed, 
whole class activity that lasted for 11 min. 

A key characteristic of ISI/Pathways is that the measurement of 
instructional time and content is assessed for individual students in 
the classroom (Connor, Morrison, et al., 2009). Hence, the system 
is able to capture the learning opportunities afforded to each 
student, for example, recording that Student A was reading with 
the teacher while, at the same time, Student B was off task and not 
redirected. A global classroom-level system would likely capture 
Student A’s instructional opportunities but not Student B’s. The 
more precise measure of each individual student’s learning oppor- 
tunities has been used to identify instructional practices as well as 
Child Characteristic < Instruction interaction effects on students’ 
reading achievement (Connor, Morrison, Fishman, et al., 2011; 
Connor, Morrison, Schatschneider, et al., 2011; Connor, Piasta, et 
al., 2009). Across studies, amounts, content, context, and types of 
instruction measured by the ISI/Pathways observation system pre- 
dicted student literacy achievement (Connor, Morrison, et al., 
2009), particularly the difference between observed and recom- 
mended individualized types/content and amounts of instruction 
(Connor, Piasta, et al., 2009). The closer the observed amount 
matched the recommended amount, the greater the students’ liter- 
acy gains were. 

We posed the following research question: 

How does combining measures of the duration of different types 
of literacy instruction and content for individual students with a 
more global measure of the CLE synergistically affect students’ 
reading comprehension and vocabulary outcomes? Using our dy- 
namic systems model of the classroom, we hypothesized that 
neither the quality of the CLE nor the amount of time individual 
students spent in different types/content of literacy instruction 
(ISI/Pathways) would be strong independent predictors of vocab- 
ulary and comprehension outcomes for third-grade students. 
Rather, we hypothesized that there would be interaction effects 
involving both systems that would significantly and positively 
predict third graders’ language and comprehension gains. Such 
interaction effects would, hypothetically, better capture the com- 
plexity of classroom instruction and the learning environment. 


Method 


Participants 


The participants included third-grade teachers (n = 27, 13 in the 
individualized reading group and 14 in the vocabulary control 
group) and their students (n = 315) in seven schools who were 
participating in a randomized controlled study evaluating the ef- 
ficacy of individualized reading instruction from first through third 
grade (Connor, Morrison, Fishman, et al., 2011). We selected third 
grade because comprehension of text becomes increasingly impor- 
tant (Gottardo, Stanovich, & Siegel, 1996; Reynolds, Magnuson, 
& Ou, 2010) as children move from learning to read to reading to 
learn (Chall, 1967). The schools were located in an economically 
and ethnically diverse school system in north Florida. 

Teachers. All teachers completed the study with the exception 
of three teachers who were not present during the last month of the 
study; however, results for these teachers and their students were 
used in the analysis because observations were completed before 
they left, and all of their participating students were assessed. All 


of the teachers met the state certification requirements and had at 
least a bachelor’s degree related to an educational field. Seven of 
the teachers had certifications or degrees beyond a bachelor’s 
degree. Teachers’ classroom teaching experience ranged from 0 to 
30 years, with a mean of 10.9 years of experience. 

Teachers in both conditions participated in half-day workshops 
in the fall and again in January for either literacy or vocabulary. 
They also participated in monthly meetings. Teachers in the read- 
ing intervention also received biweekly classroom-based support 
(not on the day observed). In total, teachers in the vocabulary 
condition received about 12 hr of professional development, and 
teachers in the individualized reading intervention received be- 
tween 18 and 20 hr (some needed more help with the technology). 
Professional development for the individualized reading interven- 
tion helped teachers learn how to individualize student instruction 
and how to be better organized. Professional development for the 
vocabulary intervention focused on vocabulary teaching methods 
described in Beck et al. (2002). Results of this study revealed that 
students whose teachers were in the individualized reading inter- 
vention group demonstrated significantly greater gains in reading 
comprehension than did the students whose teachers were in the 
vocabulary intervention group. Results are fully described in Con- 
nor et al., 2011. 

Students. Schoolwide percentages of students qualifying for 
free and reduced lunch (FARL) programs ranged from 92% (high 
poverty) to 4% (affluent). All schools used the Open Court Read- 
ing Curriculum and had a 90-min uninterrupted block of time 
devoted to reading instruction. All observations were conducted 
during this literacy block. 

Student participant demographics were collected through parent 
reports and school records and were as follows: 36% of the 
students were White, 51% were African American/Black, 3% were 
Hispanic, 3% were Asian/Asian American, 3% were multiracial, 
and the remaining 4% indicated other ethnic groups. Forty-seven 
percent qualified for FARL. Approximately 12% qualified for 
special education services. A subset of students from each class- 
room was randomly selected to be the focus of observation coding 
using the following procedure: Because this was a first- through 
third-grade longitudinal study (Connor et al., 2013), third graders 
who were in the first- and second-grade studies were automatically 
selected as target students. We then randomly selected from among 
their classmates to bring the total number of target students to a 
minimum of eight per classroom after rank ordering and randomly 
selecting within terciles so that we had a distribution of reading 
skill level. This provided the final sample of 315 students. On 
average, there were 11 target students per classroom, and this 
ranged from six to 19. Three classrooms had six target students; 
two had seven students; all others had eight or more target stu- 
dents. Comparisons of this subsample with the entire sample 
revealed no significant differences on any of the measures of 
interest. 


Observation of Instruction and CLE 


Again, in this study, we used two different observation sys- 
tems—the ISI/Pathways system, which measured the amounts and 
types/content of instruction students received, and the CLE quality 
rubric, which captured the quality of the classroom learning envi- 
ronment. Both systems used the same videotaped classroom ob- 
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servations. Classrooms were videotaped three times, once during 
the fall, once in winter, and once in the early spring of the 
academic year. This video footage of the 90-min block of time 
devoted to literacy instruction was captured using two digital 
camcorders with wide-angle lenses. Cameras were not focused on 
individual students per se. Rather, cameras were positioned at 
opposite sides of the classroom to capture as much of the class as 
possible. However, during small group instruction, one camera 
was focused on the teacher’s small group, whereas the second 
camera captured students working independently and the other 
small groups. While video recording, trained research assistants 
kept detailed field notes regarding the activities and materials used, 
including careful descriptions of target students and activities of 
students who might be off camera (Bogdan & Biklen, 1998). These 
notes were used in conjunction with video footage during coding 
and provided information for coders about students or activities 
that could not easily be seen on the videos. Observations were 
scheduled at the teachers’ convenience, so the assumption was that 
the instruction was of the highest quality the teacher could provide. 
Quality of the CLE. The quality of the CLE was assessed 
using a detailed rubric/rating scale (see Appendix A). The rating 
scale ranged from | (low) to 6 (high) and examined three global 
classroom-level dimensions— organization, support for vocabu- 
lary and language, and teacher responsiveness—with priority 
given to specific aspects of the CLE that were the focus of the 
professional development provided and that previous research has 
associated with more effective instruction (1.e., higher quality; 
Brophy, 1979; Cameron, Connor, & Morrison, 2005; Pianta et al., 
2002; Snow, Burns, & Griffin, 1998; Taylor, Pearson, Clark, & 
Walpole, 2000; Wharton-McDonald et al., 1998). Trained research 
assistants coded all three videotaped observations using the CLE 
rubric. Highly trained research assistants who were blind to the 
teachers’ treatment assignment coded the video footage. Sufficient 
interrater reliability on fall observations, based on Landis and 
Koch’s (1977) criteria, was reached prior to coding the winter 
observation (Cohen’s k = 0.73). Approximately 10% of coded 
winter and spring footage was randomly selected and recoded, and 
interrater reliability was maintained (Cohen’s k = 0.73). The 
winter observation CLE score was used in this study because 
teaching tends to be more consistent during the winter months 
(Hamre, Pianta, Downer, & Mashburn, 2007) than in the earlier 
months, when routines are getting established and teachers are just 
getting to know their students. Two teachers’ scores were based on 
the spring observation because a student teacher, and not the 
primary teacher, was teaching during the winter observation. Be- 
cause scores on the three scales were moderately correlated (r = 
.59-.60) and combining the three scores improved internal reli- 
ability, the scores were summed to provide a total CLE score. 
ISI/Pathways observation system. As noted previously, the 
ISI/Pathways system was designed to document instruction across 
three dimensions: (a) the content of the literacy instruction (e.g., 
comprehension, oral language); (b) the context (i.e., small group, 
individual, or whole class); (c) the extent to which the teacher was 
interacting with students (management), which included teacher/ 
child-managed instruction (teacher and students working together) 
or child-managed instruction (students working with each other or 
independently) (Connor, Morrison, et al., 2009). Using Noldus 
Observer Video-Pro software (XT version 8.0; Noldus, Trienes, 
Hendriksen, Jansen, & Jansen, 2000), instructional activities ob- 


served during the 90-min literacy block, which lasted 15 s or 
longer, were coded for each target student. Noninstructional prac- 
tices (including transitions) were also coded. An excerpt from the 
ISI/Pathways Coding Manual (Connor, Morrison, et al., 2009) is 
presented in Appendix B. 

Trained research assistants coded each of the videos using the 
Noldus Observer XT software. The training process was extensive, 
typically lasting 4—6 weeks until coders achieved adequate inter- 
rater reliability (Cohen’s k > 0.7) with a master coder. Questions 
about coding were discussed at biweekly coding meetings until 
consensus was achieved. Random selection and analysis of ap- 
proximately 10% of the videos revealed good ongoing interrater 
reliability among the coders (mean Cohen’s k = 0.72; Landis & 
Koch, 1977). The lengths of videos varied slightly depending on 
the season. Fall mean observation length was 85 min (SD = 30), 
73 min in the winter (SD = 35), and 79 min in the spring (SD = 
2): 

The observation system provided a detailed description of over 
200 instructional variables. For this study, we separated type of 
instruction into code-focused and meaning-focused instruction 
(see Table 1). Code-focused instruction consisted of five distinct 
types: phonological awareness, morpheme awareness, word de- 
coding, word encoding, and fluency. Meaning-focused instruction 
consisted of six distinct types: print and text concepts, oral lan- 
guage and oral vocabulary, print vocabulary, listening and reading 
comprehension, text reading, and writing. These distinctions were 
made using our theoretical framework that the largely unconscious 
and more automatic processes involved in reading, including sub- 
sentence processes (e.g., morphological awareness), and those that 
supported fluency (i.e., automaticity) were considered code- 
focused, whereas more reflective and text-level processes were 
meaning-focused (Connor, 2013b). The case can be made that 
morphological awareness should be considered a meaning-focused 
activity (Carlisle, 2000). Unfortunately, too little morphological 
awareness instruction was observed to test this alternative. For 
each student, the time (in seconds) spent in particular types of 
instruction (e.g., teacher/child-managed small group meaning- 
focused) were computed for fall, winter, and spring and then 
aggregated by taking the mean amount for each student. 


Student Assessments 


Trained research assistants assessed students’ language and lit- 
eracy skills in the fall and again in the spring. Two measures of 
comprehension and two of vocabulary were used in this study. 
Comprehension was assessed using the Woodcock-Johnson-III 
Passage Comprehension subtest (WJ-III; Woodcock, McGrew, & 
Mather, 2001) and Level 3 Gates-MacGinitie Reading Tests 
(GMRTs; MacGinitie & MacGinitie, 2006) Reading Comprehen- 
sion subtest. Vocabulary was assessed using the WJ-III Picture 
Vocabulary subtest and the GMRT Reading Vocabulary subtest. 
Alternate versions of the assessments were administered in the fall 
and spring. Scores were provided to teachers and parents. 

The WJ-III Passage Comprehension uses a cloze procedure in 
which students read a sentence or short passage and supply the 
missing word. The Picture Vocabulary task asks students to name 
increasingly unfamiliar pictures. These subtests are administered 
individually and have demonstrated alpha reliability estimates of 
-88 for passage comprehension and .81 for picture vocabulary. W 
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scores were used in the analyses because they are on an equal- 
interval scale, similar to the Rasch score (Rasch, 2001). 

In the GMRT Reading Comprehension assessment, students 
read passages of varying length and complexity based on excerpts 
from narrative and expository texts commonly used in schools. 
Students are then asked to answer multiple-choice questions, in- 
cluding some that require fairly high-level inferencing. As the 
student progresses through the test, the text becomes more difficult 
and the questions demand more inferencing. The GMRT multiple- 
choice vocabulary assessment requires students to read and then 
choose the correct meaning of an underlined word within a short 
phrase from four possible answers. Alternate form reliability for 
the GMRT has been reported to range from .74 to .92, and 
test-retest reliability has been reported as ranging from .88 to .92. 
Extended scale scores were used for data analyses, which have an 
equal interval scale. 


Analytic Strategies 


Structural equation modeling. We used structural equation 
modeling to examine the constructs of language and comprehen- 
sion, and we hypothesized that the language and literacy measures 
would comprise one latent variable (e.g., Mehta, Foorman, 
Branum-Martin, & Taylor, 2005), keeping in mind that the GMRT 
vocabulary assessment required the students to read. As antici- 
pated, the four measures were correlated (see Table 2). We used 
confirmatory factor analyses (AMOS version 21) to compare one- 
and two-factor models (Kline, 1998). We provide results for the 
spring models in detail. Results were highly similar for the fall 
models. The two-factor model (comprehension and vocabulary) 
provided only moderately acceptable fit (Tucker-Lewis Index 
([TLI] = .939; comparative fit index [CFI] = .994; root-mean- 
square error of approximation [RMSEA] = .110; Akaike’s infor- 
mation criterion [AIC] = 32.401), whereas the one-factor model 
provided a superior fit (TLI = .969; CFI = .994; RMSEA = .079; 
AIC = 31.525). We created fall and spring factor scores using 
principal component factor analysis, which explained 72.8% of the 
variance in the fall scores and 71.3% of the variance in the spring 


Table 2 


scores. Loadings are provided in Table 2. The factor variable, 
Vocabulary/Comprehension (VocComp; z score with a mean of 0 
and an SD of 1), was used in the analyses. Such factor or latent 
variable scores have the advantage of reducing measurement error 
and better capturing the complex construct of interest (Keenan, 
Betjemann, & Olson, 2008; Kline, 1998; Mehta et al., 2005; 
Vellutino, Tunmer, Jaccard, Chen, & Scanlon, 2004). 
Hierarchical linear modeling (HLM). Due to the nested 
structure of the data, students nested within classrooms and 
schools, HLM (Raudenbush & Bryk, 2002) was used to answer our 
research question. HLM analyses accounted for shared variance 
within classrooms and schools, resulting in more accurate effect 
sizes and noninflated standard errors (Raudenbush & Byrk, 2002). 
Model specification occurred in several steps. Initially, an uncon- 
ditional model was created. Variance at the classroom level was 
divided by total variance (the summation of student- and 
classroom-level variance) to obtain the intraclass correlations 
(ICCs). ICC values represent the proportion of variance falling 
between classrooms. A three-level model with students nested 
within classrooms and classrooms nested within schools was cre- 
ated. This model indicated no significant between-school variance. 
Therefore, a simpler two-level model with the spring VocComp 
factor score as the outcome (Yij) was created. Starting with the 
unconditional model, we first created a model with only the 
ISI/Pathways variables entered at the student level. Next, we 
created a model with only CLE quality entered at the classroom 
level; we then created a combined model and tested for cross-level 
interaction effects (ISI/Pathways [student] x CLE Quality [class- 
room]. We also tested for students’ fall VocComp Score X In- 
struction interactions (models are available upon request). 


Results 


Students in this sample entered third grade with vocabulary 
skills generally in line with grade-level expectations based on 
standard scores (MV = 100, SD = 15), and percentile ranks (M = 
50), which control for age. For example, their fall mean WJ Picture 
Vocabulary score was 98.92 and their spring mean score was 99.21 


Correlations, Means, Standard Deviations, Standard Scores, Percentile Rank, and Factor Loadings of the Reading Comprehension 
and Vocabulary Measures Administered During the Fall and Spring of the Academic Year 








Variable 1 2 3 4 5 6 i 8 
1. Fall WJ-PC = 
2. Fall WJ-PV ao — 
3. Fall GM-C i nee — 
4. Fall GM-V Ose .669"" 749" 
5. Spring WJ-PC .697"* ae 602" 663" a! 
6. Spring WJ-PV 503m 821" 451™ (633m 2314 = 
7. Spring GM-C G22 534™ .780"" 740" .624™ 504™ _ 
8. Spring GM-V Ole .674™" Bia 847" ron .610"* .760"* — 
M 94.75° 98.92? 51.81° 55.04” 95.96" 99.21° 52.01. 59.84 
SD 10.31% 10.82° 28.04 27.26? 10.43" 10.79* 29.79 2552" 
Maximum 124° WA 99? UP ize 150 ee Oo 
Minimum 48* 68* ie fe 54° 69* th oe 
Factor loadings 85 .798 845 7) 835 861 .168 .908 
Note. WJ = Woodcock-Johnson; PC = passage comprehension; PV = print vocabulary; GM = Gates—MacGinitie; C = comprehension; V = vocabulary. 


4 Standard scores. ° Percentile rank. 
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(see Table 2), suggesting grade-/age-appropriate gains from fall to 
spring, on average. On the basis of GMRT results, students were 
achieving slightly above grade expectations for both reading com- 
prehension and reading vocabulary, with percentiles in the fall of 
51.81 and 55.04, respectively, and spring scores of 52.07 and 
59.84, respectively. Only on the WJ Passage Comprehension mea- 
sure did students generally score below test expectations, with 
standard scores of 94.75 in the fall and 95.96 in the spring. 
Notably, students’ standard scores and percentile ranks were the 
same or higher in the spring compared with the fall, suggesting 
generally grade-appropriate gains in reading and vocabulary. 

Overall, we observed high-quality but variable CLE in these 
third-grade classrooms, with total scores ranging from 5 to 17, 
where a perfect score was 18 (M = 12.58, SD = 3.14). When 
considering the individual scales, teachers generally scored highest 
on the Orienting/Planning scale (M = 4.73, SD = 1.09) and lowest 
on the Support for Language scale (MV = 3.66, SD = 1.21). They 
were rated a mean of 4.30 (SD = 1.34) on the Responsiveness/ 
Discipline scale. 

An examination of the kinds of literacy activities that occurred 
during reading comprehension and vocabulary instruction using 
the ISI/Pathways observation system revealed variability in overall 
amounts and types of instruction (see Table 3, Figure 1, and 
Appendix C). Beginning first with oral language and print vocab- 
ulary instruction, on average, students received about 2 min per 
day of small group teacher/child-managed instruction and about 3 
min per day of small group child-managed instruction. Of this 
time, most was spent using vocabulary (Vocabulary Use, see 
Appendix B) and defining words with many of the child-managed 
activities using workbooks. See Figure | for graphs showing the 


Table 3 

Descriptive Statistics for Amount (Minimum) of Student 
Instruction Variables and Cross-Level Associations for Amount 
and Classroom Learning Environment (CLE) Ratings (From 1 
[Low Quality] to 6 [High Quality]) 





Variable N M SD Minimum Maximum 
TCM-SG-CF 315 0.82 1.68 0.00 15.05 
TCM-SG-MF 315 9.32 9.56 0.00 42.11 
CM-SG-MF Bill> 4.84 6.13 0.00 35.84 
CM-SG-CF 315 0.54 1.31 0.00 6.86 
CM-WC-CF SiS) 0.81 1.91 0.00 17.78 
CM-WC-MF SiS 6.31 6.98 0.00 29.30 
TCM-WC-CF 315 2.80 5.18 0.00 31.05 


TCM-WC-MF 315 12.61 16.15 0.00 67.81 


Note. Hierarchical linear modeling (HLM) cross-level student and 
classroom associations unstandardized coefficients. 


Classroom-level CLE quality HLM 


Student-level outcome coefficient (SE) 


TCM-SG-MF .261 (1.67) 
TCM-WC-MF —.386 (1.20) 
TCM-SM-CF .060 (.04) 
TCM-WC-CF US 51@23)) 
CM-SM-MF 117 (.44) 


Note. TCM = teacher/child-managed; SG = small group; CF = code- 
focused instruction; MF = meaning-focused instruction; CM = child- 
managed; WC = whole class. 


amounts per day (seconds) of the various types of vocabulary 
activities observed. Examining teacher/child-managed small group 
instruction in listening and reading comprehension revealed that 
students spent approximately 4 min per day in these activities, with 
most of the time spent with the teacher asking questions and 
students responding (2.2 min). Generally, more than 7 min per day 
were spent in whole-class teacher/child-managed vocabulary and 
reading comprehension activities, with most of that time spent in 
question-and-answer time (3.1 min). In all cases, the ranges were 
large, with some students receiving little vocabulary and compre- 
hension instruction and some receiving much more. 

Using these data, we created eight different variables of amounts 
and types/content of instruction (see Tables | and 3) that capture the 
entire duration of meaningful instruction that occurred during the 
literacy block: teacher/child-managed whole-class meaning-focused 
instruction; teacher/child-managed small group meaning-focused in- 
struction; teacher/child-managed whole-class code-focused instruc- 
tion; teacher/child-managed small group code-focused instruction; 
child/peer-managed whole-class meaning-focused; child/peer-managed 
small group meaning-focused; child/peer-managed whole-class code- 
focused; and child/peer-managed small group code-focused in- 
struction. To create these variables, the data were first exported 
from Observer Pro and then cleaned and examined in SPSS (Ver- 
sion 20). Because there were multiple tapes that were coded for 
one observation, these were aggregated for each student by sum- 
ming the seconds for each activity within a type of instruction 
(e.g., comprehension, print vocabulary), providing a total amount 
of instruction for each student. 

The amounts of each type of instruction were then combined 
on the basis of our theory of reading instruction (Connor, 
Morrison, & Katch, 2004; Connor, Morrison, & Petrella, 2004). 
Meaning-focused instruction was composed of all instructional 
activities that might be expected to explicitly support students’ 
language and comprehension skill gains, including text reading, 
writing, oral language, listening and reading comprehension, 
vocabulary, and other meaning-focused types of instruction (see 
Table 1). There were also activities that might be expected to 
support language and comprehension more implicitly through 
instruction in how to decode unfamiliar words and building 
automaticity, which might be considered more code-focused 
types of instruction. Descriptive statistics for each type of 
instruction are provided in Table 3. Of note, most of the time 
was spent in teacher/child-managed whole-class meaning- 
focused instruction (12.6 min per day), followed by teacher/ 
child-managed small group meaning-focused (9.32 min per 
day). Children spent about 11 min per day working with peers 
or individually on meaning-focused activities. Less time was 
spent in code-focused activities, about 6 min per day compared 
with a total of about 19 min in meaning-focused activities. 

How does combining measures of the duration of different types 
of instruction and content for individual students with a more 
global measure of the CLE synergistically affect students’ reading 
comprehension and vocabulary outcomes? 

As noted in the Method section, we ran three different models, 
with the spring VocComp score as the outcome and controlling for 
fall VocComp scores. The ICC for the unconditional model, which 
is the between-classroom variance explained, was .24, indicating 
that approximately 24% of the differences in scores among stu- 
dents were related to the classroom to which they were assigned. 
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Figure 1. 


Main Effects of CLE and Amounts, Types, and 
Content of Instruction on Students’ Vocabulary and 
Reading Comprehension Outcomes 


We then examined the effect of the CLE score on students’ 
spring VocComp score. There was a trend that CLE predicted 
students’ spring outcomes (p = .064). A 1-point increase in CLE 
quality score was associated with about a .02 increase in the 
VocComp spring score (d = .044, which is negligible). The model 
explained about 81% of the variability in children’s scores com- 
pared with the unconditional model. 

Next, we removed the CLE quality score from the model and 
added all eight of the student amount/type/content of instruction 
time. Results revealed that none of the instruction duration vari- 
ables predicted students’ scores. The model explained approxi- 
mately 80% of the variance in student scores compared with the 
unconditional model. 


CLE Quality <x Duration/Type/Content Interactions 


We then added the CLE quality score to the model and tested all 
of the CLE Quality < Time/Type/Content of Instruction interac- 
tions (see Table 4). We trimmed three variables (i.e., teacher/child- 
managed code-focused small group and whole-class instruction, 
and child-managed small group code-focused instruction) and the 
interactions that did not significantly predict student outcomes. We 
also tested for Student Fall VocComp Instruction interaction 
effects. None significantly predicted spring VocComp outcomes 
(e.g., Fall VocComp X CLE Quality interaction effect coeffi- 
cient = —.008, p = .400) and so were trimmed from the model. 
This model explained about 81.3% of the variability in students’ 
scores. 

Our models revealed a number of global CLE Quality < Indi- 
vidual Student-Level Instruction Amount/Content interactions that 
significantly predicted students’ outcomes (see Table 4 and Figure 
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Amounts (s/day) of teacher/child-managed (TCM) small group vocabulary instruction by type. 


2) regardless of students’ fall VocComp z score. Specifically, for 
teacher/child-managed small group and whole-class meaning- 
focused instruction, when provided by teachers who were judged 
to be providing a higher quality CLE, the effect for students who 
spent more time in meaning-focused instruction was much greater, 
minute for minute, than the same amount of instruction provided 
by teachers who were judged to be providing a CLE of lesser 
quality. The effect was substantial. For example, a student who 
received 18 min of teacher/child-managed small group meaning- 
focused instruction and whose teachers received a CLE score of 17 
(75th percentile of the sample) would achieve scores that were .43 
z-score points (MV = 0, SD = 1) higher than a student who received 
the same amount of instruction but whose teachers received a 
rating of 13 (25th percentile of the sample), with an effect size (d) 
of .43. The difference for teacher/child-managed whole-class 
meaning-focused instruction was smaller, .19 z-score points (d = 
.19), or about half of the small group effect size. 


Discussion 


Overall, the teachers in this study provided generally high- 
quality CLEs, but there was substantial variability with two teach- 
ers providing CLEs that were judged to be very low—a 6 or worse 
(out of 18); they received no more than a 1, 2, or 3 on all three 
scales. At the same time, six teachers had almost perfect scores of 
15, having received 5s and 6s on all three scales. There was similar 
variability in the amount of time third graders spent in teacher/ 
child-managed meaning-focused instruction both within and be- 
tween classrooms. Neither CLE quality nor the amount/content/ 
type of instruction (ISI/Pathways) individual students received 
independently predicted students’ vocabulary and comprehension 
gains. Instead, the two systems synergistically captured the com- 
plexity of classroom instruction at the individual student level and 
the more global classroom level. Teachers judged to provide a 
high-quality CLE but whose students received very little teacher/ 
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Table 4 


HLM Results: Effect of Classroom Learning Environment (CLE), Amounts/Types/Content of Student Instruction (Minimum), and 


Interaction Effects on Students’ Spring Vocabulary and Comprehension (VocComp) Z Scores and HLM Model 





Fixed effect Coefficient SE t Approx. df Pp 

Mean VocComp, —0.034 0.032 — 1.056 25 301 
CLES yn ec 0.027 0.008 3.254 25 .003 
For TCM-SG MF slope, 6, 

TCM-SG-MF effect, 10 —0.003 0.003 Omak 33 ll 453 

CLE interaction effect, y,, 0.005 0.001 3.509 331 <.001 
For CM-SG-MF slope, B, 

CM-SG-MEF effect, y25 0.005 0.005 0.971 331 52 

CLE interaction effect, >, —0.003 0.001 SleSio 331 .062 
For CM-WC-CF slope, B; 

CM-WC-CE effect, y39 0.013 0.012 1.102 331 if 

CLE interaction effect, y3, —0.004 0.002 —1.897 331 059 
For CM-WC-MF slope, B, 

CM-WC-MF effect, yo 0.001 0.004 0.314 331 .754 

CLE interaction effect, y4, 0.002 0.001 1.809 Boil 071 
For TCM-WC-ME slope, B; 

TCM-WC-MF effect, 50 0.0001 0.001 0.156 3311 .876 

CLE interaction effect, y5, 0.001 0.000 2.816 331 005 
For fall VocComp slope, B, 

Fall VocComp effect, y¢o 0.901 0.027 321995 331 <.001 

Random effect SD Variance component df x? Dp 

Classroom level, uO 0.14574 0.02124 25 50.16340 .002 
Student level, r 0.41352 0.17100 


Deviance = 464.152946 





Note. HLM = hierarchical linear modeling; TCM = teacher/child-managed; SG = small group; MF = meaning-focused; CM = child-managed; WC = 
whole class; CF = code-focused instruction. Model: Spring CompVoc; = Yoo + Yo; CLE; + Yio TCMSGMF,, + y,,CLE;;,TCMSGMFOBS,, + 
Y2o0 CMMFOBS,, + Y2,;CLE; CMMFOBS,, + 30 CMWCCFOB,, + 3;"CLE;;CMWCCF;, + Yao CMWCSGMF,, + y4;/CLE; CMWCMF,, + 
Yso TCMWCMF;, + y5,; CLE; TCMWCMF;; + Yoo Fall CompVoc Factor Score, + 6; CLE; Fall CompVocy + ug; + ry. 


child-managed meaning-focused instruction (e.g., less than 1 min, 
see Figure 2) were no more effective than those judged to provide 
a low-quality CLE. As students spent more time in teacher/child- 
managed meaning-focused small group and whole-class instruc- 
tion, differences in low- versus high-quality CLE effects became 
larger. Moreover, because students who shared a classroom expe- 
rienced different amounts of small group teacher/child-managed 
meaning-focused instruction, using both systems helped to explain 
within-classroom differences in students’ outcomes, which 
classroom-level systems obscure. 

Students showed the greatest gains in vocabulary and comprehen- 
sion when their teachers provided a high-quality CLE and they spent 
greater amounts of time (e.g., 25-35 min, see Figure 2) in teacher/ 
child-managed meaning-focused instruction, particularly when this 
instruction was provided in small groups. Teacher/child-managed 
small group meaning-focused instruction was more than twice as 
effective as whole-class instruction. When we tested students’ Fall 
Score X Instruction interactions, none significantly predicted spring 
outcomes. This indicates that results were similar for students regard- 
less of fall vocabulary and comprehension scores. 

Another consideration is that for this sample of students, teach- 
ers, and classrooms, Student Characteristic < Instruction Time/ 
Type/Content effects on students’ literacy gains were documented 
(Connor, Morrison, Fishman, et al., 2011). Specifically, using the 
ISI/Pathways system, the smaller the difference between the ob- 
served amount of a particular type/content of instruction and the 
recommended amount for a particular student based on his or her 


assessed vocabulary and literacy skills (i.e., distance from recom- 
mendation), the greater were his or her reading comprehension 
skill gains. 

These findings support a complex systems model of how individual 
children learn in the context of classroom literacy instruction and, in 
combination with other studies (Connor, Morrison, Fishman, et al., 
2011), extend our understanding of classroom instruction and learning 
environments as dynamic systems with interacting effects (Yo- 
shikawa & Hsueh, 2001). This indicates that we will be more likely to 
identify key aspects of complex teaching and CLEs by using multiple 
frameworks and considering potential interactions among sources of 
influence at both the more micro student level as well as the more 
global classroom level. Thus, a student in a CLE that is generally 
judged to be of high quality may not show achievement gains because 
the student is not receiving appropriate amounts of the particular types 
and content of instruction that would support his or her achievement. 
Indeed, students in this study who were judged to be in a high-quality 
CLE were no more likely to participate in substantial amounts of 
teacher/child-managed meaning-focused instruction than were stu- 
dents in a low-quality CLE. As one reviewer noted, “These are the 
well-organized, ‘nice’ classrooms that [some] principals and parents 
love, even though the environment while pleasant is not particularly 
effective.” 

Such interaction effects may help to explain equivocal findings 
when only one system of observation is used and individual 
student differences are not considered. For example, in the Kane et 
al. (2012) study, one reason that the assessments of CLE might not 
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Figure 2. Student-level Amounts, Content, and Type of Instruction Re- 
ceived X Quality of the Classroom Learning Environment (CLE) interac- 
tions on Vocabulary/Comprehension (VocComp) z scores. Modeled x-axis 
5th to 95" percentile of time spent in teacher/child-managed-meaning- 
focused (TCM-MF) small group (top) and whole-class (bottom) instruc- 
tion, as a function of CLE score modeled at the 25th, 50th, and 75th 
percentiles; other variables are centered at their mean. 


have predicted language arts outcomes was because, even in class- 
rooms with teachers judged to be of high quality, students might 
not have received adequate amounts of specific types of content 
instruction tailored to their learning needs. None of the observation 
systems used in the Kane study considered time in content and 
type of literacy instruction at the student level or Student Charac- 
teristic X Instruction interaction effects. At the same time, systems 
that consider a single time point in certain types-content of literacy 
instruction and that do not consider quality or Child X Instruction 
interactions may not be generally highly predictive either because 
more time spent in low-quality unaligned instruction is unlikely to 
positively predict student achievement. 

On average, students spent about only 5 min per day engaged in 
oral language and vocabulary instruction, which, by any standard, 
is a minimal amount of time. This implies that there was very little 
time for explicit instruction or other types of vocabulary instruc- 
tion (e.g., discussing or clarifying words) to take place. These 
findings are in line with previous research, which has demon- 
strated that in many elementary school classrooms, there is a very 
limited focus on vocabulary instruction (Biemiller, 2001; Durkin, 
1979; Scott & Nagy, 1997). Robust vocabulary instruction (Beck 
et al., 2002) requires teachers to allot sufficient amounts of time 
that can be used for explaining, providing examples, and elabo- 


rating on vocabulary knowledge in order to promote greater un- 
derstanding. Five minutes per day of vocabulary instruction seems 
hardly adequate. 

A closer look at the types of comprehension instruction pro- 
vided (see Appendix C) revealed that teachers used over 20 dif- 
ferent kinds of teacher/child-managed whole-class comprehension 
instructional activities (see Table C.1) and fewer (about 18) kinds 
of activities in small groups (Table C.2) to build comprehension. 
The most salient types observed involving the whole class were 
questioning (3.1 min), highlighting/identifying (about | min), and 
schema building (about 45 s) (see Appendix B). In teacher/child- 
managed small group instruction, again, the most salient activity 
observed was questioning (1.6 min), followed by compare and 
contrast (13 s) and the use of graphic organizers (8 s). In both 
whole-class and small group contexts, amounts varied widely (see 
Table C.1). For example, questioning ranged from 0 to 22 min, and 
use of graphic organizers ranged from 0 to 16 min. 

In general, the comprehension instruction observed was aligned 
with the findings of the National Reading Panel report (NICHD, 
2000), with a focus on strategies (compare and contrast; highlight- 
ing; graphic organizers) but little support for more complex un- 
derstanding of text, which is now required by the Common Core 
State Standards. We consider questioning among the most basic 
tools the teacher might use to build reading for understanding 
(Cazden, 1988). In contrast, very little time was spent on higher 
level inferencing in either whole class (about 15 s summing across 
types for whole class) or small group (about 12 s). Research shows 
that inferencing is associated with successful comprehension (Cain 
et al., 2004; Cromley & Azevedo, 2007) and is a core principal of 
the Common Core State Standards (Common Core State Standards 
Initiative, 2010). 

One aspect of the study that deserves further investigation is 
whether dimensions of the CLE might be better predictors when 
coded at the level of the individual student. It is conceivable that 
Student A and Student C might be participating in the same 
amount of time in appropriate learning opportunities but that the 
teacher is interacting with Student C in ways that are more respon- 
sive than with Student A. Hence, one might hypothesize that 
Student C will demonstrate stronger achievement than will Student 
A. Measuring CLE and instruction at the level of the individual 
student is time-consuming but may be worth the effort when trying 
to understand learning and development in classrooms. Another 
consideration is that dimensions of the CLE quality measure 
captured the social and emotional climate of the classroom as well 
as the learning environment. It may be that, particularly for chil- 
dren from low-income families, this more nurturing aspect of the 
classroom environment is providing a safe haven that facilitates 
learning, albeit indirectly. There is evidence of this effect in 
preschool, kindergarten, and first-grade classrooms (Hamre & 
Pianta, 2005; Rimm-Kaufman et al., 2005) as well as for older 
children (Reyes, Brackett, Rivers, White, & Salovey, 2012). 


Limitations 


There are limitations to this study that should be considered 
when interpreting these results. First, all participating teachers 
received professional development. This study was conducted in 
the context of a randomized controlled trial in which there was a 
significant effect of the individualized reading compared with the 
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vocabulary intervention (Connor, Morrison, Fishman, et al., 2011). 
Although there were no differences between conditions on the 
quality of the CLE, students who were in classrooms where teach- 
ers learned to individualize instruction were more likely to partic- 
ipate in teacher/child-managed small group meaning-focused in- 
struction that was individualized to their learning needs (Connor, 
Morrison, Fishman, et al., 2011). The randomized control trial may 
have influenced the results presented here, and it is possible that 
our results may not generalize to classrooms in which teachers do 
not receive professional development. It might also explain why 
there were no significant Student Fall VocComp Score X Instruc- 
tion interaction effects on spring outcomes. In another sample, 
such interaction effects might influence achievement. Addition- 
ally, the three observations were scheduled at the teachers’ con- 
venience, so we most likely observed higher quality instruction 
than might have been observed otherwise. Use of more frequent 
observations would have improved reliability (Rowan & Correnti, 
2009) but were not feasible within the funding and time constraints 
of this study. 


Practical Implications 


This study provides insight into the amounts, content, and types 
of instruction in which individual students participate and quali- 
tative aspects of the CLE that appear to be more effective for 
improving students’ literacy and language outcomes. Teacher/ 
child-managed meaning-focused but not teacher/child-managed 
code-focused instruction predicted third graders’ vocabulary and 
comprehension achievement. This might not be unexpected inas- 
much as explicit instruction in the target outcome tends to be a 
better predictor than more implicit or indirect instruction (Connor, 
Morrison, & Katch, 2004; Connor, Morrison, & Petrella, 2004; 
Connor, Morrison, & Slominski, 2006), and the outcomes were 
specifically meaning-focused. Small group instruction was twice 
as effective as whole class, perhaps because the teacher was better 
able to tailor instruction to meet the learning needs of students 
when interacting with smaller numbers of students. Additionally, 
he or she was likely to be more responsive, which the extant 
literature has established as important for student learning (Mash- 
burn et al., 2008). This responsiveness was a key dimension of the 
CLE rubric. Perhaps, the most important finding was that type, 
amount, and content of instruction individual students received 
and the quality of the CLE matter: Students should learn best when 
provided enough time in explicit instruction from the teacher who 
is interactive, responsive, organized, and focused on providing 
targeted language and literacy content in ways that facilitate lan- 
guage and vocabulary learning. 

Classroom observation systems represent an important move 
toward policy that promises to make a true difference in what is 
defined as high-quality and effective teaching, what it looks like in 
the classroom, and how these practices can be more widely dis- 
seminated so that all students, including the most vulnerable, can 
experience effective instruction and academic gains. One chal- 
lenge will be designing systems that can be used validly and 
reliably by school professionals (Crawford et al., in press; Reddy 
et al., in press) who have varying levels of expertise and knowl- 
edge about literacy development. By better understanding the 
affordances of teaching and the CLE that contribute to individual 
student’s language and literacy development, we can design more 


effective instructional regimens, identify effective standards of 
practice, discover better ways to measure effective teaching, and 
develop targeted professional development for teachers and edu- 
cational leaders that will ensure that all children have the oppor- 
tunity to learn. 
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Appendix A 


The ISI Classroom Learning Environment Scale (Excerpts) 


a 


Classroom Orienting, Organization and Warmth and Responsiveness/Control/ 








Planning Support for Language/Vocabulary Discipline 

Rating 1 Indicators Rating I Indicators Rating I Indicators 

No evidence of classroom organization. Teacher does not introduce any new words, No evidence that the teacher redirects in 
Teacher frequently does not have does not provide explicit or systematic respectful ways, nor is there evidence 
materials ready or enough materials instruction in vocabulary and does not that the teacher emphasizes student 
for all children. Classroom is provide opportunities for students to change in behavior through praise. 
frequently chaotic and very little engage in oral language. Students are not There is no evidence of the teacher 
time is spent on meaningful provided opportunities to practice key communicating what students did 
instruction. No observable system is vocabulary. Teacher does not monitor correctly or how they can improve. 
in place to facilitate students’ students’ vocabulary and comprehension. There is no evidence of students 
transition from one station or treating each other with respect. 
location to another. Whenever discipline is imposed, it is 

ineffective. 

Rating 3 Indicators Rating 3 Indicators Rating 3 Indicators 

Transitions are of reasonable length but Teacher introduces too many new Tier | Teacher inconsistently redirects in 
not consistently efficient (not all words per story/text and not enough Tier respectful ways and inconsistently 
children). There is an observable, but 2 words. Provides explicit or systematic emphasizes student change in 
not always efficient or working vocabulary instruction (not both). behavior through praise. Teacher talk 
system (e.g., center chart, daily Occasionally extends meanings. is inconsistently encouraging and 
schedule) in place for organizing Occasionally provides opportunities for respectful and inconsistently connects 
students into groups. The teacher students to engage in oral language and students’ personal experiences to 
may use a daily lesson plan (e.g., practice key vocabulary. Teacher lesson content. Inconsistently 
group activity planner print-out). monitors students’ vocabulary and communicates clearly what students 

comprehension, but rarely provides did correctly or how they can 
feedback. improve. Students inconsistently treat 


each other with respect. 





Rating 6 Indicators Rating 6 Indicators Rating 6 Indicators 
The classroom is well organized and Word knowledge is an ongoing part of the Teacher is the authority figure in the 
instruction is well organized. instructional day. Teacher’s selection of class but is never punitive. Classroom 
Classroom routine is evident. words demonstrates knowledge of the consistently offers a positive learning 
Transitions are efficient. words’ utility and relation to previously environment with clear expectations 
known words and relevance for text for students’ behavior as a member of 
being taught. Students are encouraged to the learning community. Effectively 
make connections between selects and incorporates students’ 
words/meaning they are already familiar responses, ideas, examples, and 
with and new words/meanings. Much of experiences into the lesson. 
the instruction is provided in small 
groups. 





Note. The full scale is available upon request. [SI = Individualizing Student Instruction. 


(Appendices continue) 
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Appendix B 


Listening and Reading Comprehension Excerpts From the ISI/Pathways Coding Manual 
(full manual available upon request) 


Comprehension should be coded for activities intended to increase 
students’ comprehension of written or oral text. This includes instruc- 
tion and practice in using comprehension strategies and demonstration 
of comprehension abilities. Comprehension activities generally follow 
or are incorporated into reading or listening of connected text (e.g., 
silent sustained reading followed by a comprehension worksheet, 
comprehension strategy instruction using a particular example of 
connected text, an interactive teacher read aloud during which the 
teacher models various comprehension strategies). 

7.1.10.3 Schema and Concept Building (Modifier). Listening 
and Reading Comprehension>Schema Building should be coded 
for activities which involve the teacher clarifying a concept and 
building background knowledge. For example, the teacher tells the 
students about the middle ages while reading a fairy tale. Discus- 
sions about specific words should be coded as _ Print 
Vocabulary >Class Discussion. 

7.1.10.4 Predicting (Modifier). Predicting should be coded for 
activities which involve predicting future events or information not 
yet presented based on information already conveyed by the text (e.g., 
making predictions from foreshadowing). Predicting occurs while 
reading a story and involves specific details or events, as opposed to 
Comprehension>Previewing, which involves a general prediction of 
what the text will be about. 

7.1.10.6 Inferencing — Within-Texts (Modifier). Inferencing- 
Within-Texts should be coded for activities that involve making 
inferences within a text based on information that has not been 
explicitly stated in the text, but is inferred from information 
already conveyed in the text. An example of this would be if the 
students were reading a story about a boy who lost his dog and the 
teacher asks the students, “How do you think the boy felt when he 
finally found his dog at the end of the story?” 

7.1.10.7 Inferencing — Background Knowledge (Modifier). 
Inferencing — Background Knowledge should be coded for activ- 


ities that involve making inferences within a text based on infor- 
mation that has not been explicitly stated in the text but is based on 
activating students’ background knowledge to make connections 
between their own knowledge/experiences and information pre- 
sented in the text to make inferences about the story. An example 
of this would be if the teacher is reading a story about a boy who 
loses his dog and the teacher says, “Have any of you ever lost a 
pet? How did it make you feel? How do you think the boy in the ~ 
story feels?” ** The difference between Inferencing-Background 
Knowledge vs. Prior Knowledge is that the teacher must explicitly 
ask the students to make an inference by activating background 
knowledge. 
7.1.10.8 Questioning (Modifier). Listening and Reading 
Comprehension>Questioning should be coded for activities 
which involve generating or answering questions regarding 
factual or contextual knowledge from the text (e.g., What did 
Ira miss when he went to the sleepover? What was the name of 
?), provided that these activities are not better coded by 
Comprehension>Prior Knowledge (e.g., when the teacher uses 
a question to scaffold children in activating personal knowledge 
related to the text: “When you go to an amusement park, what 
do you expect to see?”), Comprehension>Monitoring (e.g., 
when the teacher uses a question aimed at stimulating students’ 
metacognitive assessment of whether they comprehended the 
text: “Did I understand what happened there?’), or 
Comprehension>Predicting (e.g., when the teacher asks stu- 
dents to predict what will happen next: “What do you think the 
lost boy will do now?”). Questioning should also be coded for 
AR tests which are typically completed on the computer; AR 
test should also be coded as event code > Assessment. This 
code should also be used as a default code for activities where 
it is not clear whether activity is highlighting, questioning, or 
summarizing. 
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Appendix C 


Table Cl 


Descriptive Statistics for Teacher/Child-Managed Comprehension Whole-Class Instruction 
in Seconds 


Min Max M SD 
Type of comprehension instruction 
Previewing 00 374.73 24.42 72.80 
Schema building 00 588.29 46.60 112.47 
Questioning .00 2a 186.0 294.66 
Monitoring 00 235.49 Hels 29.59 
Highlighting/identifying .00 985.10 55.91 126.84 
Context cues 00 327.01 8.04 37.45 
Graphic/semantic organizers 00 29.37 12.54 86.33 
Prior knowledge 00 369.65 19.24 oil 
Retell .00 192.24 WI 29.95 
Sequencing 00 168.71 1.03 12.81 
Compare/contrast .00 380.58 22 73.03 
Comprehension other .00 847.11 Wied, 70.53 
Multicomponent integrated .00 808.00 OROTL 100.80 
Comprehension strategies 

Predicting 00 109.56 O88} 24.28 
Inferencing between texts .00 Boro OL 4.46 
Inferencing background 00 180.55 Do 23.98 
knowledge 

Inferencing within text .00 320.21 8.92 34.85 
Summarizing main idea .00 129.28 3.16 S95 
Fact vs. opinion .00 159.19 2.06 18.05 
Cause and effect .00 327.01 3.98 31.47 
Narrative text .00 OOM, 8.71 62.59 
Expository text .00 118.34 als 11.64 





(Appendices continue) 


778 


Table C2 
Descriptive Statistics for Teacher/Child-Managed Comprehension Small Group Instruction 
in Seconds 


Type of comprehension instruction 


Previewing 

Schema building 
Questioning 

Monitoring 

Highlighting identifying 
Context cues 
Graphic/semantic organizers 
Prior knowledge 

Retell 

Sequencing 
Compare/contrast 
Multicomponent integrated 


Comprehension strategies 


Predicting 

Inferencing background knowledge 
Summarizing/main idea 

Fact vs. opinion 

Cause and effect 

Narrative text 

Workbook worksheet 

Narrative Text Language Arts 
Expository Text Language Arts 
Expository Text Science 
Workbook Worksheet Language 
Workbook Worksheet Social Studies 
Blackboard Language Arts 
Expository Text Social Studies 
Workbook Worksheet Science 
Journal 

Summarizing 

Inferencing within text 
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2.10 
Guls 
95.26 
0.36 
Yop 
2.49 


2.09 
0.10 


13.27 
6.30 


9.03 
2.74 
2.47 


1.08 
0.04 


48.41 
5.82 
1S) 
49.18 
fe 
0.17 
0.94 
10.32 
0.44 
3.28 
9.48 
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Processing, and Working Memory in Predicting Chinese Written 
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The goal of the present study was to test opposing views about 4 issues concerning predictors of 
individual differences in Chinese written composition: (a) whether morphological awareness, syntactic 
processing, and working memory represent distinct and measureable constructs in Chinese or are just 
manifestations of general language ability; (b) whether they are important predictors of Chinese written 
composition and, if so, the relative magnitudes and independence of their predictive relations; (c) whether 
observed predictive relations are mediated by text comprehension; and (d) whether these relations vary 
or are developmentally invariant across 3 years of writing development. Based on analyses of the 
performance of students in Grades 4 (n = 246), 5 (n = 242), and 6 (n = 261), the results supported 
morphological awareness, syntactic processing, and working memory as distinct yet correlated abilities 
that made independent contributions to predicting Chinese written composition, with working memory 
as the strongest predictor. However, predictive relations were mediated by text comprehension. The final 
model accounted for approximately 75% of the variance in Chinese written composition. The results were 
largely developmentally invariant across the 3 grades from which participants were drawn. 


Keywords: Chinese children’s written composition, text comprehension, mediation, working memory, 
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Although humans have been engaged in writing from the time 
they first began to read, considerably more research has been 
devoted to the study of reading compared with writing (Wagner et 
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al., 2011). In the late 19th century, studies of reading were rela- 
tively common while scientific studies of writing were just begin- 
ning to appear sporadically (Bazerman, 2008). In the last couple of 
decades, a great deal of writing research has been reported (for 
reviews, see Berninger & Chanquoy, 2012; Graham & Harris, 
2009; Grigorenko, Mambrino, & Preiss, 2012; MacArthur, Gra- 
ham, & Fitzgerald, 2006). However, with the exception of a 
relatively small literature that specifically addresses relations be- 
tween reading and writing, research on writing and its develop- 
ment has proceeded largely independent of research on reading 
(Fitzgerald & Shanahan, 2000). In addition, the vast majority of 
studies on writing are limited to alphabetic writing systems. Fi- 
nally, more research has been devoted to lower levels of reading 
and writing (i.e., decoding and spelling) compared to higher levels 
(i.e., comprehension and composition). 


Origins of Individual and Developmental 
Differences in Writing 


If one asks children to produce written compositions, two em- 
pirical facts are immediately obvious. First, within a grade or 
restricted age range, individual differences are pronounced. Some 
children write fluently, producing longer, complex, and relatively 
error-free passages. Others write haltingly and are only able to 
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produce short passages replete with spelling and grammatical 
errors. The second obvious empirical fact is that developmental 
differences are obvious in writing samples produced by children 
from different grades. 

The first model of writing to gain acceptance was proposed by 
Hayes and Flower (1980). According to the model, writing con- 
sisted of three parts: planning what you wanted to say, translating 
your ideas to print, and reviewing what you are writing. The model 
did not address individual or developmental differences, but a 
revision of the model did so indirectly by incorporating cognitive 
processes, such as working memory, that supported writing 
(Hayes, 1996). Individual and developmental differences in these 
supporting cognitive processes presumably would affect writing. 

Several theories of writing were proposed subsequently that 
directly account for individual and developmental differences. 
Based on an analogy to the simple view of reading that explains 
individual and developmental differences in reading comprehen- 
sion as the interaction between listening comprehension skills and 
decoding skills (Gough & Tunmer, 1986; Hoover & Gough, 1990), 
Juel, Griffith, and Gough (1986) proposed a simple view of writing 
in which individual and developmental differences in writing are 
accounted for by an interaction between quality of ideas and 
spelling ability. More recent theories of writing have been ex- 
panded to reflect the fact that writing operates under cognitive 
constraints such as limited working memory that presumably also 
affect reading comprehension as opposed to being uniquely related 
to writing (Berninger & Winn, 2006; Torrance & Galbraith, 2006). 


Relations Between Writing and Reading 


Although research and pedagogy have viewed reading and writing 
as separate domains (Shanahan, 2006), when studies have measured 
both reading and writing the results suggest that reading and writing 
are closely related (Abbott & Berninger, 1993; Berninger, Abbott, 
Abbott, Graham, & Richards, 2002; Fitzgerald & Shanahan, 2000; 
Graham & Harris, 2000; Jenkins, Johnson, & Hileman, 2004; Juel, 
1988; Juel et al., 1986; Shanahan, 1984; Tierney & Shanahan, 1991). 
Correlational analyses of measures of reading and writing indicate 
that approximately 50% of their variance is shared. When multiple 
indicators are available and latent variables can be used to reduce the 
influence of measurement error, up to 65% of the variance in reading 
and writing appears to be shared (Berninger, Abbott, et al., 2002; 
Shanahan, 2006). 

It is not surprising that reading and writing are highly related. 
Writing and reading draw on analogous mental processes and 
knowledge, including (a) declarative knowledge (e.g., lexical 
knowledge of phonemic, graphemic and morphological awareness, 
syntax and text format); (b) procedural knowledge, such as access- 
ing and using general knowledge to integrate various linguistic and 
cognitive processes; (c) domain knowledge, such as vocabulary, 
semantics and prior knowledge; and (d) metaknowledge or prag- 
matics in knowing the interactions of readers and writers and in 
monitoring one’s own knowledge in composing and reading 
(Fitzgerald & Shanahan, 2000; Foorman, Arndt, & Crawford, 
2011). Knowledge about reading might also be applied directly to 
writing or vice versa. Shanahan and Lomax (1986) compared 
models specifying reading-to-writing, writing-to-reading, and in- 
teractive relations in a study of second- and fifth-grade students. 
The reading-to-writing model was superior to the writing-to- 


reading model, and the interactive model was superior to the 
reading-to-writing model for second grade. A recent study that 
modeled the codevelopment of reading and writing at the word, 
sentence, and passage level using latent change score modeling 
found support for a reading-to-writing model at the word and 
passage levels and for an interactive model at the sentence level 
(Ahmed, Wagner, & Lopez, in press). 

Although reading and writing have much in common, there also are 
important differences. Reading involves recognition of words, 
whereas writing requires recall as well as spelling. Reading involves 
recognizing the grammatical structure of a sentence written by an 
author, whereas writing requires generating one’s own sentence struc- 
tures. Finally, reading requires following the arguments and organi- 
zational structure used by an author writing passages, whereas writing 
requires planning and designing argument structures and organizing 
sentences into coherent paragraphs and paragraphs into coherent 
documents (McCutchen, 2006). Given these differences, it is not 
surprising that writing is more difficult than reading. 


Relations Between English and Chinese 
Writing Systems 


There are obvious differences between the English and Chinese 
writing systems but less obvious yet equally important similarities. 
Beginning with the most obvious difference, written English is a 
morphophonemic alphabet in which an orthography consisting of 
26 letters as well as additional numbers and punctuation marks is 
used to represent all possible words. English is morphophonemic 
in that spellings represent pronunciation but with deviations that 
sometimes are attributable to meaning. In contrast, the character 
set of Chinese approaches 60,000 separate characters. Each char- 
acter represents a spoken syllable that is a morpheme and often a 
word. Many of the 60,000 characters are low frequency, represent- 
ing proper names or archaic words, and one can write or read 99% 
of modern Chinese with 2,400 characters (Schmandt-Besserat & 
Erard, 2008). But this still represents a difference of about two 
orders of magnitude compared to the number of letters that must be 
learned to write English. Grammar, syntax, and punctuation are 
often ambiguous and free-flowing in Chinese writing (Yan et al., 
2012). Spoken Chinese is much more homophonic than is spoken 
English. Consequently, a large number of characters refers to the 
same syllable, and it is not possible to determine which of many 
meanings is intended without considering the surrounding context 
(Tan, Spinks, Eden, Perfetti, & Siok, 2005). 

Turning to similarities, English and Chinese are both writing sys- 
tems that convey information about pronunciation and meaning. Al- 
though it is commonly thought that Chinese characters are largely 
pictorial representations of concepts absent of pronunciation, approx- 
imately 90% of Chinese characters include a graphic element that 
indicates pronunciation, along with another graphic element that in- 
dicates meaning (Schmandt-Besserat & Erard, 2008). Similarity be- 
tween writing in Chinese and English is suggested by a study of 
underlying dimensions of written composition in 160 Grade 4 and 180 
Grade 7 Chinese children (Guan, Ye, Wagner, & Meng, 2013). They 
tested the generalizability of a five-factor model of writing developed 
by Wagner et al. (2011) from an analysis of English writing samples 
to Chinese writing samples. They asked the children to write two 
compositions and used the Systematic Analysis of Language Tran- 
scripts (SALT) program (Miller & Chapman, 2001) to code and 
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analyze the data. Guan et al. found that the five-factor model of macro 
organization, complexity, productivity, spelling and pronunciation, 
and handwriting fluency that was derived from English writing sam- 
ples applied equally well to both fourth- and seventh-grade Chinese 
writers. The fourth- and seventh-grade writers differed in the latent 
means of the factors but not in the pattern of relations among factors. 

Given both important similarities and differences between Eng- 
lish and Chinese writing systems, it is difficult to predict in 
advance which aspects of knowledge about writing learned from 
the large number of studies of English writing will generalize to 
writing Chinese. Additional studies of Chinese writing will be 
necessary if we are to develop a theoretical framework for differ- 
entiating aspects of writing that are relatively language general and 
those that are relatively language specific. 


Individual and Developmental Differences in 
Chinese Writing 


Given their role in English composition and differences between 
the English and Chinese writing systems, three potentially impor- 
tant predictors of Chinese writing are morphological awareness, 
syntax, and working memory. 

Morphological awareness. Morphology is concerned with 
intraword and interword relations. Morphological awareness has 
been shown to play an important role in reading comprehension, 
particularly after controlling for word reading (Kirby et al., 2012; 
Kuo & Anderson, 2006). Morphological awareness is typically 

“measured by tasks such as (a) morpheme discrimination in sorting 
out the odd item in orally presented four two-morpheme words 
(Packard et al., 2006), (b) morpheme production in producing a 
two-morpheme word with meaning identical to a target morpheme 
and another word with meaning unrelated to the target (Shu, 
McBride-Chang, Wu, & Liu, 2006), (c) morpheme transfer of 
homophonic two-character morphemes (Packard et al., 2006), and 
(d) morpheme analogy in generalizing a morphological relation 
from a pair of words to another pair by analogy (Kirby et al., 2012; 
Liu & McBride-Chang, 2010). 

Because the Chinese language includes many homophones, 
morphological awareness is of particular importance to Chinese 
reading and writing (Hao, Chen, Dronjic, Shu, & Anderson, 2013; 
Kuo & Anderson, 2006; Liu & McBride-Chang, 2010; Packard et 
al., 2006; Shu et al., 2006; Zhang et al., 2012). Chinese morphol- 
ogy is predominantly that of morphological compounding. A com- 
pound can be defined as a word consisting of two or more words 
that are subjected to certain phonological and morphographic 
processes (Fabb, 1998). Chinese children have been shown to have 
a better developed sense of compounding than their American 
counterparts (Zhang et al., 2012). Semantic relatedness and types 
of morphemes in Chinese play different roles at different stages of 
reading literacy development in Chinese children (Hao et al., 
2013). More proficient Chinese language users, compared with 
less proficient ones, have been shown to generate more two- 
character compound words from left-headed or right-headed base 
forms (Leong & Ho, 2008). Examples are 43% (optimistic) and 
5&2 (musical instrument) from the base form of 4k. Because 
creating sentences in Chinese demands choosing appropriate char- 
acters and surrounding context to permit the reader to infer the 
correct morpheme, morphological awareness may be critically 
important to effective Chinese writing. 


Syntactic processing. Even though many Chinese sentences 
are basically of the subject-verb-object (SVO) type, syntax is less 
straightforward in Chinese compared to English. As an example, 
the subject in a sentence may not always be expressed. The 
following simple sentence begins with the verb “downed” in 
“BRS Sf (“Downed rain already” or “It rained”). As yet another 
example, the semimorphological marker # [bei] is meant to 
express unhappy or unexpected events. It is correct to say, eI 
[bei] AF] £ (“We are [were] beaten by others”), but it is anoma- 
lous to use the negation, *#¥ {PIR [bei] AA F] F (“We were not 
beaten by others”). 

Studies relating syntactic processing in Chinese to literacy ac- 
quisition are sparse. Yeung et al. (2011) used oral cloze task of the 
kind “My favorite food is .’ to gauge syntactic skill. But 
this is more of a sentence completion task rather than a direct test 
of syntactic processing. Chik et al. (2012) included several mea- 
sures of syntactic processing in a study of reading comprehension 
in Grades 1 and 2 Chinese children. In a hierarchical multiple 
regression equation, age, IQ and Chinese word reading accounted 
for 64% of the individual variation while composite syntactic skills 
added a significant 4% of the variation. 

Working memory. Working memory is believed to be a key 
predictor of written composition because it provides the cognitive 
workspace in which writing processes are carried out (Abbott, 
Berninger, & Fayol, 2010; Berninger & Winn, 2006; Hayes, 1996, 
2006; Hoskyn & Swanson, 2003; Kellogg, 1996, 1999, 2001, 
2004; Kellogg, Whiteford, Turner, Cahill, & Mertens, 2013; Mc- 
Cutchen, 2000, 2011; Swanson & Berninger, 1996; Torrance & 
Galbraith, 2006; Vanderberg & Swanson, 2007). For example, 
Swanson and Berninger (1996) found working memory to be 
significantly correlated with writing after partialling out word 
knowledge. In particular, children with high memory span may 
allocate more resources to generating text rather than to transcrip- 
tion processes such as handwriting and spelling. Abbott et al. 
(2010) showed consistent and significant relations from spelling to 
text composing in their 5-year longitudinal study, relations that 
were explained using the construct of working memory. Children 
with strong spelling skills required fewer memory resources to 
translate ideas into written words and compositions than did chil- 
dren with weak spelling skills; more working memory resources 
were available to strong spellers relative to weak spellers to be 
applied to higher level aspects of writing. 

Working memory is involved in transcribing and editing during 
writing as shown in a study by Hayes and Chenoweth (2006), who 
used articulatory suppression to place an additional load on work- 
ing memory in a study of college undergraduates. The results were 
that participants in the articulatory suppression condition wrote 
more slowly and make significantly more errors compared to 
participants in a control condition. 


Predicting Chinese Writing at the Latent 
Construct Level 


In general, there is a paucity of research on individual and 
developmental differences in Chinese writing. Most of the previ- 
ous studies on predictors of Chinese writing have largely focused 
on relatively lower level skills such as character writing quality 
(Bi, Han, & Zhang, 2009; Guan, Liu, Ye, Chan, & Perfetti, 2011; 
Guan et al., 2013; Perfetti & Guan, 2012; Tan et al., 2005). For 
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example, Tan et al. (2005) examined relations between reading and 
writing Chinese characters for groups of beginning and interme- 
diate readers. Partial correlations between reading and writing after 
controlling for nonverbal intelligence were .50 (p < .001) and .47 
(p < .001) for beginning and intermediate readers, respectively. 
However, the cross-sectional and correlational design of the study 
precluded determining the directionality of these relations (Bi et 
al., 2009). 

More recently, Yan et al. (2012) reported a longitudinal study of 
writing at the passage level as opposed to the level of the individ- 
ual character. In their study, the writing quality of 9-year-old 
Chinese students was predicted by earlier measures of vocabulary 
knowledge, Chinese word dictation, phonological awareness, 
speed of processing, speeded naming, and handwriting fluency 
were all significantly associated with writing, after controlling for 
age. 

A limitation of the studies of predictors of Chinese writing just 
described, as well as many studies of English writing, is that the 
constructs were represented by single observed variables as op- 
posed to latent variables with multiple indicators. When constructs 
are represented by single observed variables, the obtained corre- 
lation and regression coefficients are affected by measurement 
error and method variance. One consequence is that measurement 
error and method variance can make it appear as though the 
constructs are distinct from one another, when in fact, they all are 
measuring an identical underlying construct such as language or 
verbal aptitude. Conversely, when constructs are represented by 
latent variables with multiple indicator, the effects of measurement 
error and method variance can be reduced or eliminated depending 
on the design of the study. 

Several studies have begun to look at predictors of Chinese 
writing at the latent variable level. For example, in a study of 
component processes in language literacy in 361 15-year-old Chi- 
nese students, Leong and Ho (2008) used stimulus cartoon pictures 
to elicit students’ essay writing. They also obtained measures of 
morphological processing, character and word correction, text 
segmentation, dictation, copying words and text, text comprehen- 
sion, oral reading and reading fluency. Exploratory factor analysis 
was used to examine underlying dimensions of task performance. 
Six components accounted for 67% of the total variance, with half 
of the variance accounted for by the component of lexical knowl- 
edge. This consisted of morphological processing, correct usage of 
lexical items, segmentation of text passages and writing to dicta- 
tion. These patterns were largely validated in a confirmatory factor 
analysis with a new group of 1,164 15-year-old Chinese students 
(Leong, Ho, Chang, & Hau, 2013). The strongest correlations 
among factors were obtained for correlations between the reading 
and writing factors. 


The Present Study 


The goal of the present study was to test opposing views about 
four issues concerning predictors of individual differences in Chi- 
nese written composition: (a) Whether morphological awareness, 
syntactic processing, and working memory represent distinct and 
measureable constructs or are manifestations of general language 
ability; (b) whether they are important predictors of Chinese writ- 
ten composition, and if so, the relative magnitudes and indepen- 
dence of their predictive relations; (c) whether observed predictive 


relations are mediated by text comprehension; and (d) whether 
these relations vary or are developmentally invariant across 3 years 
of writing development. 

1. Distinct and measureable constructs in Chinese or just man- 
ifestations of general language ability? Based on the existing 
literature of predictors of writing in English, and on the nature of 
the Chinese writing system, morphological awareness, syntax, and 
working memory are potentially important predictors of Chinese 
written composition. However, because previous studies typically 
have included only one of these constructs, and the constructs have 
been represented as single indicator observed variables, it remains 
important to determine whether these are meaningful distinct and 
measureable constructs as opposed to simply measures of general 
language ability. This issue was addressed in the present study by 
measuring each construct with multiple indicators and then using 
confirmatory factor analysis to test alternative models. One of 
these posited a three-factor model with morphological awareness, 
syntactic processing, and working memory as distinct yet poten- 
tially correlated abilities; another posited a single-factor model 
representing general language ability. 

2. Important predictors of Chinese written composition, and if 
so, what are the relative magnitudes and independence of their 
contributions to prediction? Although there is a theoretical ratio- 
nale for expecting morphological awareness, syntactic processing, 
and working memory to be important predictors of Chinese written 
composition, the empirical evidence is scant. Only two studies 
have used morphological processing as a predictor of writing in 
Chinese (Leong & Ho, 2008; Leong et al., 2013). No studies have 
used syntax or working memory as a predictor of Chinese writing. 
By including all three constructs as predictors in the present study, 
it was possible to determine (a) whether each was an important 
predictor of Chinese written composition, (b) the relative magni- 
tudes of their predictive relations, and (c) whether their contribu- 
tions to predictions were independent or redundant. Bivariate 
relations between latent variables representing each of the three 
predictors and the criterion were used to determine whether each 
was an important predictor of Chinese written composition. We 
used dominance analysis (Azen & Budescu, 2003; Budescu, 1993) 
to compare the relative magnitude of these predictive relations. We 
used structural equation models that included all three constructs 
as simultaneous predictors to determine whether their contribu- 
tions to prediction were independent or redundant. It would be 
possible for the constructs to be distinct, yet for their predictive 
relations to Chinese written expression to be redundant if their 
prediction of writing was due to their correlation with general 
language ability and language ability in turn predicted writing. 
Alternatively, each construct could be capturing different as- 
pects of language that were independently related to writing. 

3. Are any observed predictive relations mediated by text com- 
prehension? Because of similarities and differences between read- 
ing and writing, predictive relations between the three key con- 
structs of morphological awareness, syntactic processing, and 
working memory and the dependent variable of Chinese written 
composition might be mediated by text comprehension. Individual 
and developmental differences in morphological awareness, syn- 
tactic processing, and working memory have been shown to be 
important predictors of reading, although much of this research is 
limited to reading alphabetic writing systems. However, because 
of the constructive nature of writing, they may be involved in 
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writing to a greater extent than they are in reading. In the present 
study, we compared alternative models that proposed that predic- 
tive relations between morphological awareness, syntactic process- 
ing, and working memory were (a) unmediated, (b) partially me- 
diated, or (c) fully mediated by text comprehension. 

A variable is a mediator “to the extent that it accounts for the 
relationship between the predictor and the criterion” (Baron & 
Kenny, 1986, p. 1176). According to Baron and Kenny (1986), 
three conditions must be met to establish M as a mediator of the 
predictive relation between X and Y: (a) X must significantly 
predict Y, (b) X must significantly predict M, and (c) M must 
significantly predict Y controlling for X. Complete mediation is 
said to occur when the direct effect of X on Y decreases to zero 
with the addition of potential mediator M. Partial mediation is said 
to occur when the direct effect of X on Y decreases nontrivially but 
not to zero with the addition of potential mediator M. No media- 
tion is said to occur when the direct effect of X on Y is substan- 
tially unchanged with the addition of potential mediator M. 

To test for mediation, we used the two models represented in 
Figure 1. Figure la depicts a structural equation model that spec- 
ifies morphological awareness, syntactic processing, and working 
memory as predictors of writing. Fitting this model to the data 
provides estimates of the unique contributions to prediction of 
writing made by the three constructs. Figure 1b depicts a structural 
equation model in which text processing has been added as a 
mediating latent variable. The mediating relations are represented 
by the indirect effects from each of the three predictors through 
text comprehension to writing. The magnitude of the direct effects 
from the three predictors to writing in Figure 1b after the mediator 
variable of text comprehension is added determines whether there 
is evidence for full, partial, or no mediation. Full mediation would 
be indicated if the direct effects are no longer significantly greater 
than zero. Partial mediation would be indicated if the direct effects 
are significantly less than they were in the unmediated model but 
remain significantly greater than zero. No mediation would be 
indicated if there are no significant differences between the mag- 
nitudes of the direct effects for the mediated and unmediated 
models. 

4. Developmental differences or invariance? In a previous study 
of the underlying dimensions of Chinese written composition, a 
five-factor model of individual differences in written composition 
that originally was developed from analyses of English writing 
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samples produced by first- and fourth-grade students was found to 
generalize to Chinese writing samples produced by fourth- and 
seventh-grade students (Guan et al., 2013; Wagner et al., 2011). 
These studies suggest surprising consistency in the underlying 
dimensions of written composition across grade and language. 
However, neither of these studies examined predictors of written 
composition. As such, they are not informative about whether 
predictive relations between morphological awareness, syntactic 
processing, and working memory vary developmentally or are 
relatively invariant. In the present study, we analyzed the data 
separately by grade to examine invariance in our measurement 
model and in relations among constructs across the developmental 
range represented by fourth through sixth grades. 

The present study differs from a recently published article by 
Guan, Ye, Meng, and Leong (2013), which drew on a subset of 
poor readers and writers from the same large data pool. That study 
examined the transactional process of Chinese reading—writing 
difficulties and used similar cognitive and linguistic tasks. That 
study showed from hierarchical multiple regression analyses that 
verbal working memory contributed to individual variation in 
written composition by poor text comprehenders but not good 
readers. The study also provided insight into the quality of the 
students’ writing from a qualitative analysis of some sample writ- 
ten compositions. As discussed in this section, the emphasis of the 
present article was on the relative magnitude of predicting indi- 
vidual differences in Chinese written composition by the con- 
structs of morphological awareness, syntactic processing and 
working memory and on the predictive relations mediated by text 
comprehension. 


Method 


Participants 


A total of 749 students took part in the study. They were 
recruited to participate in a larger longitudinal study about assess- 
ment and intervention of writing disabilities (Grant DBA120179) 
and Chinese writing model (Grant YJ2012-019) conducted by the 
first author. These participants were drawn from Grades 4, 5, and 
6 in one primary school in Ningbo, Zhejiang Province, China. 
According to the municipal educational bureau report, the stu- 
dents’ parents’ average annual salary was about 25,000 USD; their 
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demographic information and social economic status are represen- 
tative of the middle class in China (NIES, 2012). Their parents 
signed the informed consent form before their children actually 
participated in this study. There were 246 Grade 4 students from 
six classes (ipo = 1425 ng = 049M ee = 9.76 years, SDiee = 
0.84), 242 Grade 5 students from six classes (Moy = 129, Nein = 
113, Mage = 11.01 years, SD,,. = 0.84), 261 Grade 6 students 
from!six classes(@i,,,°= 155, ng = 106)M,,. = 12.31 years, 
SD, ,¢ = 0.70). For the total sample, the mean age was 11.05 years 
(SD = 1.32 years). To assess the possible dependence among the 
students within a class, we calculated the intraclass correlation for 
the three writing variables and found the nesting effect was min- 
imal (intraclass correlations [ICCs] < .047). Thus, we proceeded 
with the analysis by treating these students as independent. 

The group tasks were administered in the classrooms of the 
students. Three full-time and nine part-time Chinese-speaking 
research assistants were given several days’ intensive training on 
the rationale of the project, the reasons and designs of all the tasks, 
and specifics of administration before their field work in the 
school. These experienced assistants were carefully supervised by 
the first author to ensure high fidelity of data collection. 


age 


Tasks and Procedure 


Multiple indicators were obtained to provide latent variables 
representing five constructs: 

Written composition (WC). We asked the children to write 
three kinds of compositions: narration, argumentation, and expo- 
sition. These kinds of writing are representative of important 
writing tasks (Berman & Nir-Sagiv, 2007; Britton, 1994). Narra- 
tives focus on people, their action, motivation, and events unfold- 
ing in a temporal sequence; expository compositions focus on 
issues with ideas unfolding in logical structure. Different from 
narrations and expositions, argumentation compositions require 
writers to argue and counterargue, all based on plausibility and 
factual information (Reznitskaya, Anderson, & Kuo, 2007). These 
written compositions were scored according to the three aspects of 
expressiveness, content and commentary. 

Narrative writing (WNar) was produced in response to four 
black-and-white line drawing cartoons without words and titles 
from Leong and Ho (2008). These cartoons have a universal appeal 
to all ages and can be interpreted flexibly from different perspec- 
tives. There were four basic elements in the cartoons: a boy 
reading while a girl is coming forward, a boy and girl in conver- 
sation, the two children having different opinions with an ensuing 
argument, the girl getting away and the boy falling. From these 
simple but integrated themes the students were asked to write short 
compositions from their personal experience to construct a text- 
base to describe the scenes and to express their meaning and 
emotion (see Kintsch & Kintsch, 2005). They should also provide 
appropriate discussion and a title. This task was given to groups of 
students in 20 min, and they were requested to write individually 
between 150 and 500 words. A total score of 100 was given to each 
student. Scoring was by two research assistants according to ex- 
pressive aspects (40% of total score), content including title (40% 
of total score), and commentary (20% of total score). Expressive 
aspects included total number of words, total number of new 
words, word choice of low frequent vocabulary, and lexical den- 
sity. The content included four aspects of topic, main idea, body, 


and conclusion. Commentary included two aspects of objective 
discussion and subjective comments on the theme of writing. Any 
disagreement on scoring was reexamined by a third assistant and 
resolved accordingly. Interrater reliability of the original two scor- 
ers represented by the Pearson product-moment coefficient for the 
task was .85, and test-retest reliability was .75. 

Argumentation writing (WArg) was on the advantages and 
disadvantages of watching television for elementary school chil- 
dren. Students were asked to state the pros and cons of watching 
television, and give reasons with examples to illustrate their points. 
They were also instructed to provide appropriate discussion for 
each point. Similar to expository writing, this task was adminis- 
tered to groups of students who were given 20 min to write 
compositions of between 150 to 500 words. The scoring procedure 
was the same as the narrative writing task. Interrater reliability for 
the task as a whole was .88, and test-retest reliability was .76. 

Expository writing (WExp) was writing on the topic of “My 
Favorite Pet/Toy.” Students were asked to name one of their 
favorite pets (or toys if there were no pets) and describe their 
detailed features and other interesting characteristics. This task 
was also given as a group-administered writing task in 20 min with 
students being requested to write individually between 150 and 
500 words. The scoring procedure was the same as the other two 
writing tasks. Interrater reliability for the task was a whole was .78, 
and test-retest reliability was .86. 

Text comprehension (TC). Eight short text passages were 
adapted from Leong, Tse, Loh, and Hau (2008) and rewritten in 
simplified Chinese characters. Four of the eight text passages were 
narrative pieces, and the other four were expository essays. These 
eight short essays were carefully balanced in syntactic complexity, 
ranged in length from six sentences to 13 sentences, and the 
contents were all of interest to children between the ages of 9 and 
12 years. An example was the passage “Alfred Nobel,” which 
gives an account of the contribution of the inventor and Nobel 
Prizes and contains eight sentences (two simple and six compound 
sentences). These eight passages were followed by three written 
open-ended comprehension questions each. The questions drew on 
higher order thinking, such as hypothesizing, using schemata, 
questioning, citing evidence, and verifying ideas and integrating 
them. 

The text comprehension task was administered to the students as 
a class in 40 min plus 10 min for two short practice examples to 
explain the task. In one practice example with three sentences the 
translated text is as follows: “One cold winter day, a group of 
displaced persons arrived at the small town. They were ghastly 
pale and utterly exhausted. The people of the small town cooked 
them hot meals.” In the first of the two questions the children were 
asked to discuss verbally what they would do if they were the 
displaced persons and given free food. The whole class was asked 
to give the “best” answers and was told these would be graded 
according to the depth of meaningfulness of the answers from a 
credit of 3, to 2, then | or O for an irrelevant or implausible answer. 
An answer such as “I will say ‘thank you’ and eat the food” was 
given a score of 1. An answer such as “I will say ‘thank you’ but 
also offer to do some work in return before taking food from 
strangers” was given a score of 3. The maximum score for the 
whole task was 72 (8 passages X 3 X 3). 

The principles of scoring the written answers were on the basis 
of problem solving and transforming knowledge and not merely 
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telling it (Bereiter & Scardamalia, 1987, p. 341), of explanatory 
and not just descriptive or factual answers, and of “envisionment” 
of text-worlds (Langer, 1986). The children were further told to 
work quickly and to concentrate on making meaning and not worry 
about sentence construction and spelling since they had to read the 
eight passages and to write the answers to all the 24 open-ended 
questions on the protocols in the span of 40 min. 

To ensure consistency of grading, each set of written protocols 
was marked by two research assistants according to the grading 
principles explained above. Interrater reliability for the protocols 
for the eight passages as a whole was .91. This coefficient indi- 
cates that the eight passages as a whole and the answers to the 
comprehension questions were consistent and useable. The mainly 
narrative texts (Passages Al, 2, 3, and 4) based on genre and 
structure constituted one indicator text comprehension 1 (TC1). 
Cronbach alpha was .77. The mainly expository texts (Passages B 
5, 6, 7, and 8) formed the second indicator text comprehension 2 
(TC2). Interrater reliability for this TC2 was .78, and test-retest 
reliability was .81. 

Working memory (WM). The working memory construct 
was represented by two tasks: a verbal span working memory task 
(VSWM) involving unrelated sentences and an operation span 
working memory (OSWM) task involving numbers and very sim- 
ple Chinese words. 

A verbal span working memory (VSWM) task was adapted 
from that used by Leong et al. (2008) and Leong and Ho (2008). 
It was based on the rationale and format of Daneman et al. 
(Daneman & Carpenter, 1980; Daneman & Merikle, 1996), as 
modified by Swanson (1992). Six sets of two, three and four 
sentences, all unrelated in meaning, were read orally by the ex- 
perimenter to groups of students. They first listened to each set of 
two, three, or four sentences, plus a comprehension question, all 
spoken in Putonghua, and were then to write down on designated 
forms their short answers to the comprehension question and the 
last word in each sentence of the set. The total testing time for this 
task was 20 min, and all the answers were scored independently by 
two research assistants. One point was awarded for each correct 
answer and the maximum score was 24. Interrater reliability for the 
task was .83, and test-retest reliability was .77. 

An operation span working memory (OSWM) task was modeled 
after the operation span task of Engle, Tuholski, Laughlin, and 
Conway (1999). Groups of students heard six sets of three or four 
sentences, each of which involved very simple mental arithmetic 
calculation with either a correct (YES) or wrong (NO) answer, 
followed by a simple spoken word. Students had to wait until the 
end of each sentence set before writing down on the designated 
forms just YES/NO to the answers of the simple calculation and 
the one word at the end in the correct order. An example of a 
three-sentence set is as follows: “Is 16 — 9 = 7? (Bear) YES/NO; 
Isyh2) x 2;,= 247 (Bus) YES/NO; Is 20 — 6 = 12? (Book) 
YES/NO.” The total testing time for this task was 15 min, and the 
maximum score was 21. Interrater reliability for the task was .98, 
and test-retest reliability was .76. 

Morphological awareness (Morph/MP). There were two in- 
dicators for this construct: a morphological compounding (Mor- 
Com/Morph1) task from Leong and Ho (2008) and a morpholog- 
ical chain (MorCha/Morph2) task. } 

A morphological compounding (MorCom) task contained two 
parts that varied in generating left-headed or right-headed two- 


character morphological compound words with eight base items 
each for a total group administration time of 12 min. Students 
could freely choose any five base forms to produce as many 
“right-headed” two-character words in the available time and any 
six base forms to produce as many “left-headed” two-character 
words in the total time of 6 min allotted. Two research assistants 
scored the freely affixed items to the base forms. Interrater reli- 
ability was .83, and test-retest reliability was .79. The Cronbach 
Alpha internal consistency reliablity of all the items for this 
measure was .70. 

A morphological chain (MorCha) task required the participants 
to provide as many different two-character compound words from 
the left-headed base character as possible in 5 min time. The 
constraint was that the same base form and homophonic base 
forms could not be repeated. Two research assistants scored the 
freely affixed items to the base forms. Interrater reliability was .98, 
and test-retest reliability was .82. Cronbach Alpha was.74. 

Syntactic processing (SP). Syntactic processing plays an im- 
portant role in helping language users to understand the appropri- 
ate relationship between topics and comments and the interpreta- 
tion of the sentence. A topic is what the sentence is about, and the 
comment is the rest of the sentence, separable from the topic by a 
pausal marker. A topic sets a “spatial, temporal, or individual 
framework within which the main prediction holds” (Li & Thomp- 
son, 1989, p. 85). Syntactic processing is thus an interactive 
process with lexical knowledge and sentential context mutually 
influencing each other. There were two tasks: syntax construction 
and syntax integrity. 

A syntax construction (SynCon) task consisted of 10 scrambled 
sentences scrambling mostly two-character words. Students were 
asked to recombine the words in the scrambled sentences to come 
up with the correct sequence of the lexical items to make the 
sentences grammatically correct in the recombination. The admin- 
istration time was 20 min. All the answers were scored by two 
research assistants. Interrater reliability was .88, and test-retest 
reliability was .80. Cronbach alpha of the internal consistency of 
all the items for this measure was .71. 

A syntax integrity (SynInt) task required error detection and 
correction. The syntax integrity task assessed the students’ under- 
standing and correct usage of syntactic structure. The students 
were asked to read each of the 20 short grammatically anomalous 
sentences, detect the error in the syntactic pattern and to correct 
that error. There were 20 sentences, and the testing time was 25 
min. All the answers were scored by two research assistants. 
Interrater reliability was .92, and test-retest reliability was .81. 
Cronbach alpha of the internal consistency of all the items for this 
measure was .82. 


Procedures 


The tasks were administered to groups of students over three 
consecutive days. The verbal span working memory task, morpho- 
logical compounding task, text comprehension 1, and narrative 
writing task were administered on Day 1. The syntax construction 
task, text comprehension 2, and argumentation writing task were 
administered on Day 2. The syntax integrity task, the morpholog- 
ical chain task, the operation span working memory task, and 
expository writing task were administered on Day 3. Instructions 


786 GUAN, YE, WAGNER, MENG, AND LEONG 


for each task were audio-taped and played to the students groups, 
so that all the tasks were administered uniformly across groups. 


Data Analysis 


The data analysis was carried out in three stages after data 
screening as follows. 

Confirmatory factor analysis. The first stage was to assess 
the construct validity and measurement invariance of the proposed 
latent variables. At this stage, we first conducted confirmatory 
factor analysis (CFA) for each of Grades 4, 5, and 6. In each CFA 
model, one of the factor loadings for each factor was fixed to be 
one for scale dependency in model identification. In the second 
step, we assessed measurement invariance across grades. The 
purpose of testing measurement invariance was to establish that 
either partial- or full-measurement invariance was established 
across grades. Failing to do so would preclude meaningful com- 
parisons across grades because of concern that the latent variables 
were not comparable. 

There are several forms of invariance in the procedure. Here we 
tested metric invariance (equal factor loadings) and scalar invari- 
ance (equal intercepts) using multigroup CFAs. Metric invariance 
is required for comparing latent means, while there is debate on 
whether scalar invariance is needed (Ployhart & Oswald, 2004). A 
stepwise procedure was adopted to assess measurement invariance 
(Vandenberg & Lance, 2000): (a) A baseline model was analyzed 
without any equality constraints for corresponding factors; (b) an 
equal factor loading model was analyzed with equality constraints 
imposed on corresponding factor loadings (metric invariance). If 
all factor loadings were invariant, we continued to (c) assess 
invariance of intercept (scalar invariance). If all factor loadings 
were not invariant, we found out which variables had equal factor 
loadings and then among these variables, which had equal inter- 
cepts. The chi-square difference test was used to assess the invari- 
ance of factor loadings and intercepts. Chi-square difference test- 
ing was conducted using the Satorra-Bentler adjusted chi-square 
(Satorra, 2000; Satorra & Bentler, 2001). With measurement in- 
variance established, latent means were compared across grades 
with latent standardized effect sizes reported (Choi, Fan, & Han- 
cock, 2009). 

Structural equation models. The second stage of data anal- 
ysis consisted of testing alternative structural models to estimate 
the strength of predictive relations between morphological aware- 
ness, syntactic processing, and working memory as predictors of 
written composition and the potential mediating effects of text 
comprehension on these predictive relations. Chi-square difference 
testing was used to compare results across grades. 

For the CFA and structural equation modeling (SEM) analyses, 
the goodness of fit between the data and the specified models was 
estimated by employing the comparative fit index (CFI; Bentler, 
1990), the Tucker—Lewis index (TLI; Bentler & Bonett, 1980), the 
root-mean-square error of approximation (RMSEA; Browne & 
Cudeck, 1993), and the standardized root-mean-square residual 
(SRMR; Bentler, 1995). CFI and TLI guidelines of greater than 
0.95 were employed as standards of good fitting models (Hu & 
Bentler, 1999). Different criteria are available for RMSEA. Hu and 
Bentler (1995) used .06 as the cutoff for a good fit. Browne and 
Cudeck (1993) and MacCallum, Browne, and Sugawara (1996) 
presented guidelines for assessing model fit with RMSEA: values 


less than .05 indicate close fit, values ranging from .05 to .08 
indicate fair fit, values from .08 to .10 indicate mediocre fit, and 
values greater than .10 indicate poor fit. A confidence interval of 
RMSEA provides information regarding the precision of RMSEA 
point estimates and was also employed as suggested by MacCal- 
lum et al. A SRMR < .08 indicates a good fit (Hu & Bentler, 
1999). All CFA and SEM analyses were performed with Mplus 6.1 
(Muthén & Muthén, 2010). 

Dominance analysis. The third stage of data analysis con- 
sisted of dominance analysis (Azen & Budescu, 2003) to assess the 
unique contribution and relative importance of morphological 
awareness, syntactic processing, and working memory in account- 
ing for variance in written composition. For the dominance anal- 
ysis, the dependent variable written composition was calculated as 
the sum of standardized scores of the three writing tasks. The 
predictors, working memory, morphological processing, and syn- 
tactic processing were also calculated as the sum of the standard- 
ized scores of the corresponding tasks. 

The purpose of dominance analysis is to address the problem 
that the relative importance of correlated predictors is affected by 
the other predictors included in or excluded from the model 
(Cohen, Cohen, West, & Aiken, 2003; Courville & Thompson, 
2001). Common measures of relative importance, including stan- 
dardized regression coefficient, zero-order correlation, partial cor- 
relation, semipartial correlation, are affected by this phenomenon. 
More recently, dominance analysis, developed by Budescu (1993) 
and refined and extended by Azen and Budescu (2003), presents a 
better alternative for analysis of predictor importance and provides 
a general approach to measure relative importance in a pairwise 
fashion in the context of all models that contain some subsets of 
the other predictors (Azen & Budescu, 2003). Dominance analysis 
is able to answer the key question of predictor importance: “Is 
variable X; more or less (or equally) important than variable X; in 
predicting Y in the context of the predictors included in the 
selected model?” (Azen & Budescu, 2003, p. 145). 

Several measures of dominance were introduced that differ in 
the strictness of the dominance definition. Here we adopted the 
strictest definition of dominance, complete dominance. We illus- 
trate this with three predictors (X1, X2, X3), as in the current 
study, to predict one criterion variable. All possible model com- 
binations of predictors were examined, including three subset 
models with only one predictor, three models with two predictors, 
and one model with all four predictors, resulting in a total of seven 
subset models. Predictor X1 is said to have complete dominance 
over predictor X2 when unique variance contribution of Predictor 
XI is greater than Predictor X2 in each of the subset models to 
which both X1 and X2 could make additional contribution, i.e., the 
null model without any predictor, and the model with X3. 

To generalize dominance results beyond the studied sample, we 
followed Azen and Budescu (2003) in calculating the standard 
error of dominance across repeated sampling and the reproducibil- 
ity of the present dominance in the population. Let D;; denote a 
measure of dominance, which equals | if X, dominates X;, equals 
0 if X; dominates X,, and equals .5 if dominance cannot be 
established between the two predictors. A distribution of D,; could 
be simulated by obtaining this measure over many (e.g., 1,000) 
repeated samples with replacement, which are generated using the 
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bootstrap procedure. The average of these dominance values over 


all bootstrap samples, Dj;, represents the expected level of domi- 
nance of X; over X; in the population. The standard error of Dj, 
SE(D;,), is the standard deviation of D;; over all bootstrap samples. 


Dj; closer to 0 or lindicates a strong case for a clear directional 
dominance, while that close to .5 suggests indeterminacy of 
dominance. The percentage of the bootstrap samples that rep- 
licates a dominance of, for example, X, over X,, in the studied 
sample is termed reproducibility, which states the probability 
that X; dominates X, and determines a confidence level on that 
probability. 


Results 


Preliminary Analyses 


Table 1 displays the means, standard deviations, skewness, and 
kurtosis of the various measures used in the study by grade level. 
In addition, the Shapiro-Wilk test was used to examine the 
normality of these measures. Several variables were not nor- 
mally distributed, and there were moderate ceiling effects for 
operational span working memory. The assumption of multi- 





Table 1 
Descriptive Statistics of All Tasks by Grade 
Task M SD 

Grade 4 (n = 246) 
Verbal Span Working Memory 15.09 6.94 
Operational Span Working Memory 16.30 als 
Morphological Compounding 10.05 3.60 
Morphological Chain 10.09 4.92 
Syntax Construction 11.38 3.63 
Syntax Integrity 11.50 Baill 
Text Comprehension Task 1 14.29 6.84 
Text Comprehension Task 2 14.43 5.61 
Narrative Writing 52.09 9.42 
Argumentation Writing 47.49 L27 
Expository Writing alEo2 9.63 

Grade 5 (n = 242) 
Verbal Span Working Memory L755 5.68 
Operational Span Working Memory 17.04 4.27 
Morphological Compounding 12335 4.30 
Morphological Chain 12.82 5.39 
Syntax Construction 13.74 3.62 
Syntax Integrity 12.96 3.42 
Text Comprehension Task | 19.00 7.69 
Text Comprehension Task 2 19.24 (eae 
Narrative Writing 60.65 0 
Argumentation Writing 56.92 13.54 
Expository Writing 62.69 10.60 

Grade 6 (n = 261) 
Verbal Span Working Memory oro 5.38 
Operational Span Working Memory 18.83 3.74 
Morphological Compounding 14.09 4.16 
Morphological Chain 14.25 6.61 
Syntax Construction 15.44 2.48 
Syntax Integrity 14.44 SalZ 
Text Comprehension Task 1 Pes C2, 
Text Comprehension Task 2 21.83 6.96 
Narrative Writing T5a3 13599 
Argumentation Writing 67.26 14.40 

70.78 12.92 


Expository Writing 


variate normality was violated, Mardia’s skewness = 75.04, 
p <.001, and Mardia’s kurtosis = 156.65, p < .001. To address 
the nonnormality, Satorra-Bentler correction was implemented 
for both model fit and parameter estimation by using maximum 
likelihood with robust standard errors (MLR) estimation. Table 
2 presents the intercorrelations of these measures by grade. 


Confirmatory Factor Analyses 


The results of confirmatory factor analyses carried out by 
grade using the 11 tasks as indicators of the latent variables 
indicated that the five latent factors were measured well with 
these tasks for each grade. Specifically, the fit of the confirma- 
tory factor model was satisfactory for Grade 4, X°(34) = 90.57, 
p < 001, RMSEA = .08 with 90% confidence interval (.06, 
.10), CFI = .96, TLI = .93, SRMR = .04; satisfactory for 
Grade 5, x7(34) = 79.54, p < .001, RMSEA = .08 with 90% 
confidence interval (.06, .10), CFI = .96, TLI = .94, SRMR = 
.03; and satisfactory for Grade 6, x7(34) = 94.62, p < .001, 
RMSEA = .08 with 90% confidence interval (.06, .10), CFI = 
.96, TLI = .93, SRMR = .04. Table 3 presents the standardized 
factor loadings and the correlations among factors. The corre- 
lations between the three predictors of morphological aware- 


Skewness Kurtosis Shapiro-Wilk P 
a2) eles) 0.92 .000 
as) 0.70 0.82 .000 

0.47 =O 0.97 .000 

0.43 079) 0.95 .000 
—0.26 —0.76 0.97 .000 

0.15 —0.48 0.98 .003 

0.77 0.65 0.95 .000 

0.24 —0.54 0.98 .004 
= ,09 0.53 0.92 .000 
= (Oat — 1.44 0.98 .000 
—0.80 0.60 0.90 000 
OI —0.41 0.91 .000 
Salads 3.16 0.80 000 
One 0.07 0.98 * 008 

0.18 —0.54 0.98 001 
(7A IkS3 0.95 .000 
—0.70 0.65 0.96 000 
—0.28 all 0.94 000 
S028 —0.86 0.97 000 
Sa Onont —(io8 0.96 .000 
= 032 =Oi68 0.97 .000 
ales 1.66 0.86 .000 
eS 1nOS) = Ol39 0.83 000 
Soe Ais 0.64 000 

0.24 0.75 0.98 004 

0.06 pelea 0.95 000 
= 30 3.00 0.90 000 
—0.88 ie 0.94 000 

0.03 SUS) 0.97 000 
—0.40 Ore 0.97 000 
alo So 0.86 000 
—0.85 0.42 0.94 .000 
=) 1.88 0.89 .000 





788 GUAN, YE, WAGNER, MENG, AND LEONG 


Table 2 
Correlations of Tasks for Grades 4, 5, and 6 


Variable 1 2, 3 


‘Seibeeer ity poems" ya eee Pee | de BT Eb Pee ect PN) Se eee es eee 


Grade 4 (n = 246) 


1. Verbal Span Working Memory — 
2. Operational Span Working Memory Eo) — 
3. Morphological Compounding Al 29 — 
4. Morphological Chain 22 19 44 
5. Syntax Construction 0) ei, 2) 
6. Syntax Integrity oy) 23 24 
7. Text Comprehension Task 1 56 44 ee) 
8. Text Comprehension Task 2 .60 43 61 
9. Narrative Writing 49 51 Do 
10. Argumentation Writing A8 46 Sill 
11. Expository Writing Al 48 ao) 


Grade 5 (n = 242) 


1. Verbal Span Working Memory — 

2. Operational Span Working Memory zy) — 

3. Morphological Compounding 34 = — 
4. Morphological Chain 2 30 47 
5. Syntax Construction 34 30 28 
6. Syntax Integrity 24 23 28 
7. Text Comprehension Task 1 =D) 49 50 
8. Text Comprehension Task 2 50 47 43 
9. Narrative Writing 45 40 42 
10. Argumentation Writing 40 mS) ISD, 
11. Expository Writing 47 46 44 


Grade 6 (n = 261) 


1. Verbal Span Working Memory a= 
2. Operational Span Working Memory .68 _— 
3. Morphological Compounding 45 38 oo 
4. Morphological Chain 43 moll 45 
5. Syntax Construction 28 oe 18 
6. Syntax Integrity 31 oo oS) 
7. Text Comprehension Task 1 65 4 65 
8. Text Comprehension Task 2 .66 Poi 5) 
9. Narrative Writing Bill 50 pol 
10. Argumentation Writing 58 Oe 44 
11. Expository Writing 47 42 45 


.28 = 

29 34 — 

45 50 Al — 

4] 45 42 wo — 

44 50 38 82 .64 —_ 

oo 43 38 he 62 AE) — 

34 30 a, no) 44 14 .66 — 
25 —_— 

oll 38 — 

34 51 cou — 

Sy a, 38 18 — 
239) 35 228) 18 61 — 
Pou, 42 3S) 64 48 we — 
537 46 oT 64 47 69 69 — 
12 — 

I 48 _- 
38 oy) eo — 
Bo) =)! 45 82 — 
39) 40 2D) .69 .67 = 
2g) 2S “3D .68 .66 72 — 
a2 Cail 28 38 au) 14 .62 —_— 





Note. A\ll coefficients are statistically significant at the .05 level except the one in bold. 


ness, syntactic processing, and working memory and the crite- 
rion of written composition were substantial at each grade level, 
ranging from a low of .57 to a high of .79. 

The adequate model fits and the moderate correlations between 
the three predictors of morphological awareness, syntactic process- 
ing, and working memory supported the view that these represent 
distinct and measurable abilities. The alternative view that they 
are just manifestations of a single underlying factor of general 
language ability was tested by a model that represented the 
indicators of morphological awareness, syntactic processing, 
and working memory as indicators of a single general language 
factor. This model resulted in a significantly poorer fit at each 
grade, with Ax? values of 65.32. 134.27, and 88.68 for Grades 
4, 5, and 6, respectively, all significant at p < .001 for Adf = 
WI. 

We examined the measurement invariance between grades 
using multigroup CFA (see Table 4). The baseline model re- 
sulted in a good fit. The model with equal loadings resulted in 
a significantly poorer fit. We examined each variable individ- 
ually and found that narrative writing (WNAR) had a different 
loading for Grade 5. We further tested the invariance on inter- 
cepts and found that the model with equal intercepts resulted in 


a significantly poorer fit. We examined each variable individ- 
ually and found that operational span working memory 
(OSWM) had a different intercept for Grade 5, and text com- 
prehension task 2 had a different intercept for Grades 5 and 6. 
These results suggest that partial measurement invariance held 
across grades. 

Table 5 presents the latent means and variances of the five 
factors for each grade. The fifth graders had significantly higher 
means than the fourth graders, and the sixth graders had significant 
higher means than the fourth and fifth graders (ps < .001). The 
latent standardized effect sizes (Choi et al., 2009) for pairwise 
comparisons on the latent means (as shown in Table 5) suggested 
that the mean difference was medium between Grades 4 and 5, 
small to medium between Grades 5 and 6, and large between 
Grades 4 and 6. 

In summary, the results indicated that the latent variables were 
well measured by their indicators. Measurement invariance was 
largely supported, allowing us to compare latent means across 
grades. Sixth grade students had significantly higher means than 
the fifth-grader students, who in turn had significantly higher 
means than the fourth grade students. 


WC 


.90 (.03) 


Le 
91 (.02) 


.90 (.02) 


Grade 6 
SP 
.74 (.06) 
.65 (.07) 


MP 


.87 (.07) 
51 (.06) 


WM 


.89 (.03) 
.77 (.04) 


WC 
.84 (.03) 
.83 (.03) 
84 (.03) 


Te 
95 (.02) 
.83 (.03) 


SP 
.87 (.05) 
.67 (.06) 


Grade 5 


MP 


.79 (.06) 
.60 (.06) 


WM 


.79 (.04) 
.74 (.05) 


WC 


.94 (.02) 


Ge 
.92 (.02) 
.87 (.02) 


Grade 4 
SP 
.64 (.07) 
53 (.06) 


MP 


.75 (.05) 
58 (.05) 


WM 


.84 (.05) 
-71 (.06) 





Standardized Factor Loading (Standard Error) and Interfactor Correlations From Confirmatory Factor Analysis 
Task 


Operational Span Working Memory 
Morphological Compounding 


Morphological Chain 
Reading Comprehension Task 2 


Reading Comprehension Task | 
Narrative Writing 


Verbal Span Working Memory 


Syntax Construction 
Syntax Integrity 
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Structural Equation Modeling 


.82 (.02) 
79 (.05) 


Structural equation models were used to examine hypothesized 
relations among the measured constructs (presented in Figure 1). 
We fit the model simultaneously to all three grades while con- 
straining the factor loadings (except WNAR for Grade 5) and 
intercepts (except OSWM for Grade 5, and text comprehension 
task 2 for Grades 5 and 6) equal across grades as supported by the 
measurement invariance results. We first fit a model that specified 
morphological awareness, syntactic processing, and working 
memory as predictors of written composition. This model provided 
an excellent fit, x7(df = 81) = 140.09, p < .001, CFI = .98, TLI = 
.97, RMSEA = .05 (90% CI: .04-.06), and SRMR = .04. We then 
added text comprehension as a mediating variable and reestimated 
the model. This model provided a satisfactory fit, x*(df = 122) = 
334.21, p < .001, CFI = .95, TLI = .93, RMSEA = .08 (90% CI: 
.O7—.09), and SRMR = .05. The results from these two models are 
presented by grade in Table 6. 

The first column presents bivariate latent variable correlations. 
Squaring these correlations gives an estimate of the shared vari- 
ance between each predictor and written composition. The second 
column presents structure coefficients that were obtained when the 
latent variables were used as simultaneous predictors of written 
composition. The coefficients represent the independent contribu- 
tions to prediction for each predictor. The third column presents 
structure coefficients for the predictors after text comprehension 
was added to the model as a potential mediator. The extent to 
which these structure coefficients were reduced compared to those 
without the mediator in the analysis indicates whether full, partial, 
or no mediation was occurring. The final two columns present 
estimate and bias corrected bootstrap 95% confidence interval of 
indirect effects of the predictors on written composition via the 
mediator variable of text comprehension. Significant indirect ef- 
fects, noted by confidence intervals not containing zero, provide 
evidence of mediation. 

The results of the first set of structural equation analyses indi- 
cated that morphological awareness, syntactic processing, and 
working memory were related to written composition individually 
and made independent contributions to prediction when considered 
as simultaneous predictors. When text comprehension was added 
as a potential mediator, the overall pattern of results was consistent 
with complete mediation. The predictive relations between the 
three predictors and written composition approached zero when 
text comprehension was added as a mediator. The bootstrap 95% 
confidence interval (CI) indicates that the mediation effect was 
significant for Grades 5 and 6, while marginally significant for 
Grade 4 (90% CI did not contain zero). The model accounted for 
approximately 75% of the variance in written composition. 

Figures 2, 3, and 4 present the standardized path coefficients of 
the mediation model (as in Figure 1b) for the three grades. A 
chi-square difference test was conducted to examine whether each 
path was equal across grades (see Table 7). All paths were found 
equal except the path from working memory to text comprehen- 
sion, which was found equal between Grades 5 and 6, but not 
between Grades 4 and 5 (p = .04) nor between Grades 4 and 6 
(p = .003). 

In summary, the results of structural equation models supported 
complete mediation of the effects of morphological awareness, 
syntactic processing, and working memory on written composition 


.68 
Sy), 


35 
16 
65 





.68 
ay) 


61 

46 51 

62 81 

62 73 
text comprehension; WC = written composition. Al] coefficients are significant at p < .001. 


58 


AT 
wis 
.66 


85 (.03) 
.76 (.04) 


.84 
73 85 


.67 
84 
78 


74 
69 
WM = working memory; MP = morphological awareness; SP = syntactic processing; TC 


Argumentation Writing 
Expository Writing 
Inter-factor Correlations 
Morphological Awareness 
Syntactic Processing 
Text Comprehension 
Written Composition 


Note. 
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Table 4 
Examination of Measurement Invariance Between Grades 4, 5, and 6 


Model Description df x CFI TLI RMSEA 90% CI SRMR Adf Ay? 
Model 1 Baseline Model NO BE25°° Oly Oe! .08 LOfenO9 .04 a 
Model 2 (compared to Model 1) Model with equal loadings LASO 80 | OSS .08 .07,.09 06 1 SHS) 
Model 3 (compared to Model 1) Model with equal loadings except i. QoeSe Me 05 .08 (0751, 095.05 11 14.03 
narrative writing of Grade 5 ee 
Model 4 (compared to Model 3) Model 3 + equal intercepts 12582915 Oe er 598 .08 .07, .09 05 12 34.90 
Model 5 (compared to Model 3) Model 3 + equal intercepts except 122 299.95"** 96 94 07 .06, .09 05 9 14.61 


operational span working memory 
of Grade 5 and text comprehension 
task 2 of Grades 5 and 6 








Note. CFI = comparative fit index; TLI = Tucker-Lewis coefficient, RMSEA = root-mean-square error of approximation; CI = confidence interval; 


SRMR = standardized root-mean-squared residual. 
eS OVI, 


via text processing. The results were largely consistent across 
grades. 


Dominance Analysis 


Table 8 presents the unique contribution in terms of proportion 
of variance explained by the four variables predicting written 
composition. The first column contains the total R* for the corre- 
sponding subset model, and the remaining columns report the 
unique variance contribution added to that subset model. For 
example, for Grade 4, the subset model with WM demonstrates 
that 34% of written composition variance was accounted for by 
WM. After controlling for WM, the unique contribution to vari- 
ance was 14% for MP and 8% for SP. In the subset model of 
WM-MBP, the two predictors jointly accounted for 48% of vari- 
ance, with 3% unique variance added by SP. Based on these 
unique contributions, we calculated the average contribution of a 
predictor as the mean of its average contribution over the subset 
models with the same number of predictors (Budescu, 1993). For 
all three grades, WM was the strongest predictor of written com- 
position, uniquely contributing, on average, 20.67% of the vari- 
ance for Grade 4, 17.00% of the variance for Grade 5, and 24.17% 
of the variance for Grade 6. This was followed by MP (Grade 4: 
18.83%; Grade 5: 12.50%; Grade 6: 12.17%) and TC (Grade 4: 
11.67%; Grade 5: 11.67%; Grade 6: 9.67%). 

In Table 9, the first and second columns identify the two 
variables being compared; the third column is the value of domi- 
nance measure D,,, in the sample; the fourth column is the average 
value (Dj) over the 1,000 bootstrap samples; and the fifth column 
is the standard error of the D,, values. The next three columns 


Table 5 
Latent Means, Variances, and Latent Standardized Effect Sizes 


describe the distribution of D,, over the 1,000 bootstrap samples, 
where P;, is the proportion of samples in which X; dominates X,, P;; 
is the proportion of samples in which X, dominates X;, and P.,,,; 1s 
the proportion of samples in which the dominance could not be 
established. The last column is the reproducibility of the sample 
results, i.e., the proportion of bootstrap subsamples that agree with 
the tested sample results. 

Examining the sample dominance values and reproducibility 
indices, working memory dominates morphological processing 
with high reproducibility for Grades 5 (82%) and 6 (98%), and 
with low reproducibility for Grade 4 (56%). Working memory 
dominates syntactic processing for all three grades with high 
reproducibility (92% for Grade 4, 83% for Grade 5, and 93% for 
Grade 6). The sample suggests that morphological processing 
dominated syntactic processing for Grade 4 (with 90% reproduc- 
ibility), but undetermined for Grades 5 and 6. 

In summary, the results of dominance analysis were that work- 
ing memory dominated syntactic processing for all grades. For 
Grade 4, morphological awareness dominated syntactic process- 
ing. For Grades 5 and 6, working memory dominated morpholog- 
ical awareness. The other pairwise comparisons on unique contri- 
bution did not suggest reproducible dominance relationship. 


Discussion 


The goal of the study was to test opposing views about four 
issues concerning predictors of individual differences in Chinese 
written composition. We discuss results that address each issue 
before turning to limitations of our study and issues that are 
important to be addressed in future research. 











Grade 4 Grade 5 Grade 6 Standardized effect size 
Variable M Variance M Variance Variance Grade5—Grade4 Grade6—Grade5 Grade6—Grade4 
Working Memory 0.00 32733) 2.43 20.48 4.06 21.63 0.47 0.36 0.78 
Morphological Awareness _—_0.00 SON mea 3 10.10 4.09 11.95 0.78 0.53 1.30 
Syntactic Processing 0.00 Sol EoD fila 3.87 3.94 0.80 0.80 1.80 
Text Comprehension 0.00 37.05 4.67 51533) 6.75 51.05 0.70 0.29 1.01 
Written Composition 0.00 194.70 9.74 229.76 19.4] 176.18 0.67 0.68 1.43 





Table 6 
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I 


Bivariate Correlations, Structure Coefficients, Structure Coefficients With Mediation, and Indirect Effects for Grades 4, 5, and 6 





Indirect effect 





Structure Structure coefficient 
Grade Variable Correlation coefficient with mediator Estimate Bootstrap 95% Cl 
Grade 4 Working Memory 69" OD 0.31 0.57 (—0.25, 3.46) 
Morphological Awareness ah 1.96™ 0.57 1.29 (GOO nS id) 
Syntactic Processing Oe 258" 0.79 1.59 (GO nli amanda) 
Grade 5 Working Memory Lay Oa 0.24 1.70 (0.82, .3.28) 
Morphological Awareness [62 1.48™" O53 0.95 (0.34, 2.08) 
Syntactic Processing Coe Rise 0.09 1.28 (0.56, 2.38) 
Grade 6 Working Memory 3 os 1626 0.59 (ey (0.06, 2.75) 
Morphological Awareness Aopy 2) esl ae 0.04 E22 (0.04, 3.55) 
Syntactic Processing ST 2.01" 0.12 58 (O22, DE) 
Note. CI = confidence interval. 


2 01s p =< 018 pr 001) 


1. Distinct and measureable constructs in Chinese or just man- 
ifestations of general language ability? Because previous studies 
typically represented morphological awareness, syntactic process- 
ing, and working memory as single indicator observed variables 
and did not include all three constructs, an important first step was 
to determine whether they represented distinct constructs or were 
merely manifestations of general language ability. We did this in 
the present study by including all three constructs and representing 
each as a latent variable with multiple indicators. 

Based on the adequate model fits obtained for confirmatory 
factor analysis models that specified them as distinct yet poten- 
tially correlated abilities, and the fact that the obtained factor 
correlations ranged from .35 to .67, the results support morpho- 
logical awareness, syntactic processing, and working memory as 
distinct yet correlated constructs, as opposed to just manifestations 
of general language ability. A single factor model specifying that 
the indicators of morphological awareness, syntactic processing, 




























































; VSWM 
ao = 84 Working 
ay Memory 
49 oswM #& 
57 
43 MorCom w\ 
Morphological 
Awareness 
.67 Morcha (*° K 
% 
67 eae 
58 Syntax : 
° Construction |*. ; 
Syntactic Woo Wa ie! 
6\ Processing 
74 Syntax la? 
Integrity SS 
Figure 2. 


except those indicated by dashed lines (ps > .37). 






and general language ability were indicators of general language 
ability resulted in substantially and significantly poorer model fits. 
The results then supported the view that morphological awareness, 
syntactic processing, and working memory are distinct and mea- 
surable constructs rather than just manifestations of general lan- 
guage ability. They are not independent, however, and their shared 
relations with general language ability are a likely source of their 
moderate intercorrelation. 

2. Are morphological awareness, syntactic processing, and 
working memory important predictors of Chinese written compo- 
sition, and if so, what are the relative magnitudes and indepen- 
dence of their contributions to prediction? Based on (a) their role 
in predicting English reading and to a lesser degree, English 
writing; (b) relations between reading and writing; and (c) char- 
acteristics of the Chinese writing system that place a premium on 
these three constructs, a theoretical rationale exists for expecting 
morphological awareness, syntactic processing, and working 
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Structural equation model of Grade 4 showing standardized effects of working memory, morpho- 
logical awareness, and syntactic processing on text comprehension (TextCom) and written composition. 
VSWM = verbal span working memory; OSWM = operation span working memory; MorCom = morpholog- 
ical compounding (from Leong & Ho, 2008); MorCha = morphological chain; Com = comprehension. All 
factor loadings, correlation coefficients, regression coefficients, and residual variances are significant at p < .03, 
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Figure 3. Structural equation model of Grade 5 showing standardized effects of working memory, morpho- 
logical awareness, and syntactic processing on text comprehension (TextCom) and written composition. 
VSWM = verbal span working memory; OSWM = operation span working memory; MorCom = morpholog- 
ical compounding (from Leong & Ho, 2008); MorCha = morphological chain; Com = comprehension. All 
factor loadings, correlation coefficients, regression coefficients, and residual variances are significant at p < .01, 


except those indicated by dashed lines (ps > .15). 


memory to be important predictors of Chinese written composi- 
tion. However, there is scant empirical evidence that tests this 
proposition. Results from the present study supported the impor- 
tance of morphological awareness, syntactic processing, and work- 
ing memory as important predictors of Chinese written composi- 
tion. Factor correlations between the predictors of morphological 
awareness, syntactic processing, and working memory and the 
criterion of written composition obtained from the confirmatory 
























































factor analyses, which are equivalent to bivariate regression coef- 
ficients, ranged from .6 to .8. 

Finding morphological awareness to be an important predictor 
of Chinese written composition is consistent with previous studies 
that suggest it 1s related to learning to read in Chinese (Hao et al., 
2013; Kuo & Anderson, 2006; Liu & McBride-Chang, 2010; 
Packard et al., 2006; Shu et al., 2006; Zhang et al., 2012) and 
predicts Chinese writing ability (Leong & Ho, 2008; Leong et al., 
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Figure 4. Structural equation model of Grade 6 showing standardized effects of working memory, morpho- 
logical awareness, and syntactic processing on text comprehension (TextCom) and written composition. 
VSWM = verbal span working memory; OSWM = operation span working memory; MorCom = morpholog- 
ical compounding (from Leong & Ho, 2008); MorCha = morphological chain; Com = comprehension. All 
factor loadings, correlation coefficients, regression coefficients, and residual variances are significant at p < 
.002, except those indicated by dashed lines (ps > .28). 
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Examination of Equality of Path Coefficients Between Grades 4, 5, and 6 





Model Description df x? ‘Nd ANe 
Model 1 Baseline Model Oe 33402 
Model 2 (compared to Model 1) — Model with equal WM—>TC Aw 342 2 7.43" 
Model 3 (compared to Model 1) — Model with equal MP—>TC 124 334.13°** Dy} rey 
Model 4 (compared to Model 1) — Model with equal SP—>TC Ass OO. 2 0.85 
Model 5 (compared to Model 1) Model with equal TC—Writing 124 334,24"™" 2 0.60 
Model 6 (compared to Model 1) — Model with equal WM—Writing ASOe Sone 2 0.45 
Model 7 (compared to Model 1) — Model with equal MP—Writing DA Sesoso4n Y 32 
Model 8 (compared to Model 1) Model with equal SP—Writing 124 33,22) 2 0.99 








Note. WM = working memory; TC = text comprehension; MP = morphological awareness; SP = syntactic 


processing. 


73 = Sp 00” 


2013). Finding working memory to be an important predictor of 
Chinese written composition is consistent with results of previous 
research that has focused primarily on working memory in monolin- 
gual Chinese- and English-speaking children (Chung & McBride- 
Chang, 2011; Kellogg, 2001, 2004). Working memory may contrib- 
ute to writing performance because of the need to hold information 
in short-term working memory while retrieving information from 
long-term memory (McCutchen, 2011; Vanderberg & Swanson, 
2007). During this process, mental representation and focused 
manipulation of information are important. The role of syntactic 
processing appeared to be larger for higher relative to lower 
grades. This is consistent with the observation that more skilled 
writers apply their knowledge of syntax to their writing to a greater 
extent than do less skilled writers (Cromer & Wiener, 1966) and 
with the observation that knowledge of syntactic structures is 
necessary for processing higher level genres (Beers & Nagy, 2009, 
2011). 

Turning to relative magnitudes of prediction, the results of 
dominance analysis indicated that working memory was the stron- 
gest predictor of Chinese writing, followed by morphological 
awareness and syntactic processing, which were largely compara- 
ble with some trend for morphological awareness to dominate 
syntactic processing as a predictor. These results are comparable 
with research showing the importance of working memory as a 


Table 8 
Dominance Analysis Results of Variables Predicting Writing 


predictor of writing in English (Berninger, Abbott, et al., 2002; 
Fitzgerald & Shanahan, 2000; Graham, 2006; Shanahan, 2006). 

Finally, we wanted to determine whether morphological aware- 
ness, syntactic processing, and working memory made indepen- 
dent contributions to prediction of Chinese written expression or 
whether their predictive relations were redundant, perhaps because 
they were correlated with language ability and language ability in 
turn predicted writing. The results of structural equation modeling 
supported the independence of their contribution to prediction. 
Significant structure coefficients were found for each predictor 
when they were included as simultaneous predictors of written 
composition (Table 6) without including text processing as a 
mediator. 

3. Are observed predictive relations mediated by text compre- 
hension? Given the similarities and differences between reading 
and writing discussed earlier, it was important to determine 
whether predictive relations between the three key constructs of 
morphological awareness, syntactic processing, and working 
memory and the dependent variable of Chinese written composi- 
tion might be mediated by text comprehension. We therefore 
compared alternative models that proposed that predictive rela- 
tions between morphological awareness, syntactic processing, and 
working memory were (a) unmediated, (b) partially mediated, or 
(c) fully mediated by text comprehension. 





Unique contribution of predictors to writing 

















Grade 5 Grade 6 
R? WM MP SP R? WM MP SP 
28 .08 09 39 04 04 
23 13 09 24 19 10 
21 15 11 mS 24 lS 
36 05 42 04 
BB) 05 43 03 
32 09 34 mie) 
L9) 46 





Grade 4 
Subset model R WM MP SP 
Models with one predictor 
WM 34 14 .08 
MP 132 16 .08 
SP 24 18 All 
Models with two predictors 
WM-MP 48 .03 
WM-SP 42 .09 
MP-SP 40 11 
Models with three predictors 
WM-MP-SP ae 
Note. WM = working memory; MP = morphological awareness; SP = syntactic processing. 


794 


Table 9 
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The Sample Dominance and Their Means, Standard Errors, Probabilities, and Reproducibility 


Over 1,000 Bootstrap Samples 


————————— 


Grade i i Dj; Dj SE(D;;) 
Grade 4 1 D 1.0 0.62 0.46 
1 3 1.0 0.95 0.19 
2 3 1.0 0.93 0.22 
Grade 5 1 2 1.0 0.84 0.35 
1 3 1.0 0.85 0.34 
2 3 0.5 0.53 0.45 
Grade 6 1 2 1.0 0.99 0.09 
I 3 1.0 0.96 0.15 
Z 3 0.5 0.60 0.41 


Pi Pi P aot Reproducibility 
0.56 0.32 0.11 0.56 
0.92 0.02 0.06 0.92 
0.90 0.04 0.07 0.90 
0.82 0.14 0.05 0.82 
0.83 0.13 0.04 0.83 
0.43 0.38 0.19 0.19 
0.98 0.00 0.02 0.98 
0.93 0.01 0.06 0.93 
0.45 0.25 0.31 0.31 








Note. 1 = Working Memory; 2 = Morphological Processing; 3 = Syntactic Processing. 


Our results were consistent with the view that the predictive role 
of morphological awareness, syntactic processing, and working 
memory in accounting for individual differences in written com- 
position is mediated through text comprehension. The mediation 
model accounted for approximately 75% of the variance in written 
composition. The results supported full rather than partial media- 
tion and are consistent with other studies that suggest writing 
depends on reading (Ahmed et al., in press; Fitzgerald & Shana- 
han, 2000; Shanahan & Lomax, 1986). However, further research 
in necessary to support text comprehension as a true mediator. At 
a minimum, our results indicate that morphological awareness, 
syntactic processing, and working memory do not predict written 
composition independently of text comprehension. A true mediat- 
ing role would require evidence that text comprehension actually 
facilitates written composition. Without further evidence from 
longitudinal and experimental studies, it is possible that the ob- 
served relation between text comprehension and written composi- 
tion might be subserved by a third construct such as language or 
verbal aptitude. 

It should also be noted that in the design of the study we asked 
our participants to write three different genres of composition— 
narration, argumentation and exposition—so as to provide as com- 
prehensive a picture as possible of the students’ writing perfor- 
mance. Even though our intent was not to analyze the effects of our 
predictors on each kind of writing, we were also interested in the 
relative performance of the students. The results show the general 
trend of better performance of narratives, then expository writing 
followed by argumentation writing, grade for grade (Table 1). This 
differential performance by the Grades 4, 5, and 6 students is in 
keeping with the findings of the literature (Bereiter & Scardamalia, 
1987; Langer, 1986). There is also evidence from recent reading 
psychology literature that different competencies contribute to 
children’s comprehension of narrative, expository and argumenta- 
tion texts because of their different structure and different demands 
made on resource allocation (Best, Floyd, & McNamara, 2008; 
Reznitskaya et al., 2007). It is likely what is known for reading 
applies equally for writing (Englert, Stewart, & Hiebert, 1988). 

For our specially designed text comprehension tasks with four 
narrative and 4 expository texts and the use of open-ended written 
comprehension tasks emphasizing inference, we also aimed at a 
broader portrayal of text comprehension. Our approach in design- 
ing the text comprehension tasks should address some of the 


concerns raised about the influence of text and question types 
influencing reading comprehension (Eason, Goldberg, Young, 
Geist, & Cutting, 2012). What is not known is the mediating effect 
of particular genres of text on particular genres of writing. What is 
also not known is the effect of prior knowledge and knowledge 
utilization in writing. From inspection of the writing protocols and 
observation of the students it seemed that they were more intent on 
content generation and followed the task-execution model of 
knowledge telling, rather than the knowledge transformation of 
Bereiter and Scardamalia (1987). 

The existing literature has not yet settled on a clear consensus 
about the nature and direction of relations between reading and 
writing (Aarnoutse, van Leeuwe & Verhoeven, 2005; Abbott et 
al., 2010; Babayigit & Stainthorp, 2011; Berninger, Vaughan, et 
al., 2002; Caravolas, Hulme, & Snowling, 2001; Cataldo & 
Ellis, 1988; Lerkkanen, Rasku-Puttonen, Aunola, & Nurmi, 
2004; Shanahan & Lomax, 1986; Sprenger-Charolles, Siegel, 
Bechennec, & Serniclaes, 2003). Text comprehension and writ- 
ten composition would seem to draw on similar linguistic and 
cognitive mechanisms and are likely to be mutually facilitative, 
but further studies are needed to understand their codevelop- 
ment. 

4. Developmental differences or invariance? In the present 
study, we analyzed the data separately by grade to examine the 
extent to which our results varied by grade across the develop- 
mental range represented by fourth through sixth grades. The 
results supported developmental invariance on two levels. First, 
the measurement models were largely invariant across the three 
grades, supporting the assertion that the latent variables used to 
represent the constructs of interest were equivalent across the three 
grades. This enabled examination of changed in latent variable 
means across grades. Second, relations among the latent variables 
also were largely invariant across grades. These results indicate 
that the fourth through sixth grade students differed primarily in 
latent variable means, rather than in what the latent variables 
measured or how they were related with one another. These results 
are consistent with other recent studies that showed evidence of 
developmental invariance in writing (Guan et al., 2013; Wagner et 
al., 2011), although it is important to keep in mind the relatively 
limited developmental range represented by the fourth through 
sixth grades. 
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Limitations, Implications, and Future Directions 


One limitation of our study is relying on a cross-sectional rather 
than a longitudinal design. Although this design has the virtue of 
a relatively larger number of participants and a shorter duration 
compared to a longitudinal design, examination of developmental 
differences is confounded with potential cohort effects. A longi- 
tudinal design would be particularly helpful for a more rigorous 
test of mediational relations (Abbott, Amtmann, & Munson, 2006). 
A second limitation is the limited developmental range represented 
by including participants from grades four through six. Although 
writing performance does change over this period of time, a larger 
developmental range would be helpful in studying what changes 
and what does not with development. A third limitation is the 
limited nature of our writing tasks. The methods for scoring quality 
of writing and comprehension were not typical and that for the 
quality scores some of the reliabilities were less than might be 
desired. Also, we did not incorporate important topics such as the 
processes involved in planning, formulating ideas, editing, and 
revising them to form coherent and cohesive written texts; writing 
for different purposes; and discourse knowledge about forms of 
writing (Graham, 2006; Graham & Harris, 2007; Graham & Perin, 
2007; Olinghouse & Graham, 2009). Our writing tasks were group 
administered, which makes it possible that the writing behavior of 
a given student was influenced by the surrounding context of other 
students. Further, we did not control for the effect due to legibility 
before we scored the students’ writing task (Graham, Harris, & 
Hebert, 2011). A meta-analysis by these authors suggests that 
legibility has a large effect on scoring quality of writing (Graham, 
& Hebert, 2011). Another limitation is that all students were from 
a single school, although this might assure participants coming 
from similar socioeconomic background and language/literacy ex- 
perience. We also acknowledge that our results apply to normally 
developing writers and may not apply to students with impair- 
ments in writing or other aspects of language. 

Despite its limitations, the current study provides greater con- 
tributions and practical implications to the field of educational 
psychology, educational practice, and possibly educational policy. 
First, the study might be one of the first that established the 
important measureable predictors for Chinese written composition. 
Second, by conducting dominance analysis, the study might be one 
of the first that revealed the relative magnitudes and independent 
contributions of each unique linguistic and cognitive factor to 
written composition. In writing practices, the teachers will be 
informed of how to focus on their elements of writing instructions 
to improve students’ writing performance. The third contribution is 
to theories of educational psychology of writing research. The 
study conducted confirmatory factor analysis to distinguish our 
three-factor model of writing (morphological awareness, syntactic 
processing and working memory) from the single factor model of 
general language ability. Fourth, we compared theoretically the 
alternative models of unmediation, partial mediation, and full 
mediation of text comprehension of these three key constructs and 
Chinese written composition. As well, we provided theoretically 
based empirical evidence to show that the predictive relations 
between the three key constructs of linguistic and cognitive factors 
and the dependent variable of Chinese written composition might 
be mediated by text comprehension. This provides a potential 
alternative view of how we could address the predictive relations 


among reading and writing variables. Our fifth contribution is to 
educational policy. This relates to ways of assessing reading and 
writing, and the predictive relations between linguistic and cogni- 
tive measures mediated by text comprehension. The results of our 
study might suggest a blueprint of reading and writing for educa- 
tional policy makers. 

In future studies, it will be important to consider different genres 
of written composition more specifically, as they may make dif- 
ferent demands on planning, translating and review processes and 
on cognitive resources such as working memory that underlie them 
(Kellogg, 2001; Torrance & Jeffery, 1999). Finally, there is a need 
for randomized controlled trials of instructional approaches and 
interventions directed towards improving writing skill (Cutler & 
Graham, 2008). 
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We examined the impact of a teacher-led intervention, implemented during regular classroom instruction 
and homework, on fourth-grade students’ preference for self-regulated learning, finding main ideas in 
expository texts, and reading comprehension. In our quasi-experimental study with intact classrooms, (a) 
students (n = 266, 12 classrooms) who received regular classroom instruction (REG) were compared 
with (b) students (n = 268, 12 classrooms) who were taught text reduction strategies (TEXT) and (c) 
students (n = 229, 9 classrooms) who were introduced to text reduction strategies within the framework 
of a 7-step cyclical model of self-regulated learning (SRL + TEXT). Participating classrooms were 
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not be in different intervention conditions. Both in their posttest and follow-up test results (11 weeks after 
the intervention), SRL + TEXT students showed a stronger preference for self-regulated learning than 
students of the 2 other groups. The SRL + TEXT students also identified more main ideas over the 
course of the intervention. Positive effects on reading comprehension in a standardized test were 
restricted to students without migration background. 
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Self-regulated learning represents a key skill in our rapidly 
changing society and one that needs to be taught and practiced as 
early as possible (Council of the European Union, 2002). The 
substantial number of effective interventions focusing on self- 
regulated learning for elementary-school students supports this 
fact. However, teacher-led interventions produce effect sizes 
smaller than those of researcher-led interventions (Dignath & 
Biittner, 2008). Yet teacher-led interventions are particularly im- 
portant in that they are well suited for encouraging knowledge 
transfer, as the transfer of self-regulated learning skills from the 
context in which they were acquired to other domains and contexts 
works best when the skills are introduced and taught in multiple 
authentic learning settings (Dignath & Biittner, 2008; Hattie, 
Biggs, & Purdie, 1996). We, therefore, designed a teacher-led 
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intervention for self-regulated learning for elementary school stu- 
dents that (a) is appropriate for regular classroom instruction and 
for homework (the two most important scholastic learning set- 
tings) and (b) can be applied in multiple subjects. 


Theoretical and Empirical Background 


Meta-analyses indicate that for elementary-school settings, self- 
regulation interventions based on social cognitive theory are 
among the most effective (Dignath & Biittner, 2008; Dignath, 
Buettner, & Langfeldt, 2008). In his frequently cited, social- 
cognitive-theory-based model, Zimmerman (1989, 2000) divides 
the self-regulation process into three subsequent phases: a fore- 
thought phase, a performance or volitional-control phase, and a 
self-reflection phase. 

The forethought phase encompasses those prerequisite pro- 
cesses that precede actions and learning efforts. The performance 
or volitional-control phase includes processes that are important 
during learning and influence one’s focus and behavior. During the 
self-reflection phase, which begins after learning and concludes 
the cyclical model by Zimmerman (2000), learners evaluate the 
outcome of their learning. Processes occurring during the self- 
reflection phase influence the next forethought phase. Each phase 
within the model brings together numerous cognitive, metacogni- 
tive, and motivational aspects (for an overview, cf. Zimmerman, 
2000). 

By conceptualizing optimal self-regulated learning, models such 
as Zimmerman’s (2000) provide a basis for designing interven- 
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tions and for conducting research on self-regulated learning. Re- 
search findings indicate that acquainting intervention participants 
with an intervention’s theoretical model improves both its effec- 
tiveness and transfer (Salomon & Perkins, 1989; Stahl, Simpson, 
& Hayes, 1992). However, as theoretical models such as that of 
Zimmerman (2000) include numerous aspects in each phase and 
are therefore relatively complex, a simplified version should be 
taught to intervention participants (Stoeger & Ziegler, 2008a, 
2011; Zimmerman, Bonner, & Kovach, 1996). Model simplifica- 
tion is, moreover, particularly important when interventions target 
children in elementary school (Zimmerman, 1990). 

For our intervention, we chose a simplified seven-step cyclical 
normative model of self-regulated learning (Ziegler & Stoeger, 
2005) that reflects a limited number of important aspects from the 
Zimmerman model (cf. online supplemental material, Figure S1). 
The simplified model stresses those cognitive and metacognitive 
aspects for which there are promising results from earlier inter- 
ventions with elementary school students (Dignath & Biittner, 
2008; Stoeger & Ziegler, 2008a). This model places less emphasis 
on motivational aspects, as motivation issues appear to play a 
greater role in interventions for secondary school students (Dig- 
nath & Biittner, 2008). The first three steps of the intervention 
model represent aspects contained within Zimmerman’s (2000) 
forethought phase. They are self-assessment, goal setting, and 
strategic planning. The next three steps—strategy implementation, 
strategy monitoring, and strategy adjustment—represent aspects 
contained within Zimmerman’s (2000) performance or volitional- 
control phase. These three steps constitute an internal cycle within 
the larger seven-step cyclical model and can be applied to various 
cognitive strategies (e.g., organizational strategies, rehearsal strat- 
egies; cf. Weinstein & Mayer, 1986). The final step in the seven- 
step cycle of self-regulated learning, outcome evaluation, is de- 
rived from the third phase of Zimmerman’s (2000) model. As in 
Zimmerman’s (2000) model, this final step influences the way 
students work through the cycle of self-regulated learning the next 
time. 

In addition to the choice of the underlying theoretical model, 
several other features of self-regulated learning interventions have 
been associated with producing particularly large effect sizes. 
Effect sizes are larger if interventions emphasize the benefit of 
strategy use and provide systematic feedback (Dignath & Biittner, 
2008; Hattie & Timperley, 2007; Schunk & Rice, 1987). Further- 
more, evidence indicates that introducing self-regulated learning 
with concrete subject matter in real-life settings is particularly 
effective and helpful for improving transfer to other tasks and 
situations (Dignath & Biittner, 2008; Hattie et al., 1996). Addi- 
tionally, interventions are especially effective when they simulta- 
neously address both in-class instruction and homework contexts 
(Ramdass & Zimmerman, 2011; Stoeger & Ziegler, 2011). Finally, 
the duration of an intervention has an influence on its overall 
efficacy and the extent to which learners succeed in transferring a 
given skill from the context in which it was taught into new subject 
areas and learning contexts (Alexander, Graham, & Harris, 1998; 
Pressley, Graham, & Harris, 2006). 


Present Research 


We sought to build upon current research findings by develop- 
ing a 7-week teacher-led training program that students apply 


during regular classroom instruction and homework. We devel- 
oped the intervention for fourth grade in accordance with Bavarian 
state curriculum guidelines that explicitly mandate the introduction 
of self-regulation skills during fourth grade (Bayerisches Sta- 
atsministerium fiir Unterricht und Kultus, 2000). Basic science! 
and reading instruction were chosen as content areas. Based on the 
research reported earlier, we make the assumption that introducing 
self-regulated learning in the context of two school subjects and 
during in-class instruction and homework should facilitate the 
transfer of self-regulated learning skills. 

In accordance with this content focus, we selected text reduction 
strategies for Steps 4 through 6 of the seven-step cycle of self- 
regulated learning. The training program introduces students to 
three reduction strategies that are useful for identifying and dis- 
playing main ideas: underlining and copying main ideas verbatim, 
drawing a mind map containing main ideas, and summarizing 
main ideas in one’s own words. 

We selected these strategies with four reasons in mind: First, we 
sought to design and implement an ecologically valid intervention. 
For this reason, we selected those strategies which the state cur- 
riculum recommends for regular fourth-grade German instruction 
in Bavaria (Bayerisches Staatsministerium fiir Unterricht und Kul- 
tus, 2000). 

Second, students can effectively learn to use all three of these. 
strategies during regular classroom instruction, and their use of 
these strategies can lead to improvements in finding main ideas 
and reading comprehension (for an overview, cf. National Insti- 
tutes of Child Health and Human Development, 2000; Slavin, 
Lake, Chambers, Cheung, & Davis, 2009). 

Third, as we designed our intervention for regular classrooms 
with children representing a wide spectrum of ability levels, we 
were interested in selecting strategies that are appropriate both for 
average readers (e.g., Bean & Steenwyk, 1981; Griffin, Malone, 
Kameenui, 1995) and for less advanced readers and students with 
learning disabilities (Kim, Vaughn, Wanzek, & Wei, 2004; Ma- 
lone & Mastropieri, 1992). 

Fourth, our choice of strategies also reflects findings indicating 
that teaching these strategies is particularly effective when they are 
taught in combination with one or more of the other steps covered 
in the seven-step cycle of self-regulated learning. Main-idea in- 
struction is more effective when it is combined with self- 
monitoring than when it is presented by itself (Jitendra, Hoppes, & 
Xin, 2000; Malone & Mastropieri, 1992). Similarly, interventions 
combining instruction on finding main ideas in texts or on text 
comprehension strategies with goal setting is more effective than 
the same interventions without goal setting (Schunk & Rice, 1989; 
cf., however, Johnson, Graham, & Harris, 1997). Furthermore, 
evidence documents the superiority of teaching various metacog- 
nitive strategies in combination with text strategies in comparison 
to regular instruction or to teaching only text strategies (Mason, 
2004; Souvignier & Mokhlesgerami, 2006). However, to our 
knowledge, there are no intervention studies in which students 
learn about a specific model of self-regulated learning and then— 


' The subject is called Heimat- und Sachunterricht in Bavaria, Germany, 
and deals with basic aspects of everyday life, including topics from 
biology, geography, physics, health, and social sciences. 
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with substantive knowledge of the model—work systematically 
through the individual steps of the model. 

While combining the presentation of a normative model with 
opportunities for practicing the application of the model’s steps, 
our intervention design reflects these insights on effective self- 
regulated learning interventions. With these goals in mind, we 
designed a 7-week training program consisting of 2 informational 
weeks and 5 learning-cycle weeks. During the informational 
weeks, teachers introduce the seven-step cycle of self-regulated 
learning and the text reduction strategies mentioned previously. 
The knowledge presented during the informational weeks is then 
proceduralized in the five learning-cycle weeks. In other words, 
once students have understood the basic ideas behind the skills 
described in the seven-step cycle of self-regulated learning (during 
the two informational weeks), they then use the learning-cycle 
weeks to actually start developing these skills through practice 
with specific content (i.e., an expository text of the same length 
and difficulty level every day) and with the help of various 
learning materials. For example, participants receive learning jour- 
nals (cf. Hiibner, Niickles, & Renkl, 2010) with which they doc- 
ument their learning behavior, difficulties they encounter, and 
adjustments they make to their learning strategies (cf. the Method 
section). The learning journals and various other intervention 
materials help students to recognize the usefulness of the meta- 
cognitive and text strategies introduced in the intervention (Dig- 
nath & Biittner, 2008; Schunk & Rice, 1987). To facilitate this 
process, teachers give feedback and help the students to system- 
atically draw connections between learning behavior and learning 
achievements (cf. the description in the Method section). 

As we mentioned previously, meta-analyses indicate that 
teacher-led interventions are not as effective as researcher-lead 
interventions. However, in order to facilitate the transfer of the 
skills presented in the intervention to everyday practices through- 
out a child’s school and homework activities, it is essential that 
classroom teachers lead the interventions. To increase the effec- 
tiveness of our teacher-led intervention, we placed an emphasis on 
the initial training of teachers prior to the administration of the 
program as well as on ongoing support during their implementa- 
tion of the program. Before conducting the training program in 
their classrooms, teachers completed 2 full days of training. They 
also received extensive training materials and a teachers’ manual 
designed to help them and their students avoid the sorts of barriers 
typically encountered in strategy instruction (cf. Kline, Deshler, & 
Schumaker, 1992). We also accompanied the administration of the 
entire 7-week training program (cf. Guskey, 1986). 

In the present study, we examined whether the intervention as 
described leads to effects in students’ self-reported preference for 
self-regulated learning, their ability to find main ideas in exposi- 
tory texts, and their overall reading comprehension. We compared 
three groups: students who receive regular instruction (REG), 
students who receive special instruction in text reduction strategies 
(TEXT), and students who receive instruction in text reduction 
strategies embedded in a training program focused on the seven- 
step cycle of self-regulated learning (SRL + TEXT). The com- 
parison between the SRL + TEXT condition and the REG group 
shows the effect of the entire intervention approach compared with 
regular classroom instruction. This comparison is especially rele- 
vant from a practical perspective. The comparison between the 
SRL + TEXT condition and the TEXT condition shows the 


additional benefit of teaching text reduction strategies within the 
context of a cycle of self-regulated learning. This comparison is 
especially relevant from a theoretical perspective. 

In addition to a summative evaluation with pretest, posttest, and 
follow-up data collection, we also incorporated a process evalua- 
tion (cf. Stoeger & Ziegler, 2008a; Zimmerman, 2008). For the 
summative evaluation, all three groups of students filled out a 
learning preferences questionnaire and completed a standardized 
reading comprehension test at three points in time: before the 
intervention, directly after its conclusion, and then 11 weeks later. 
In our process evaluation, we observed whether the number of 
identified main ideas increased as students in the intervention 
groups worked on the daily expository texts. 

In light of previous research, we expected an increase in the 
number of identified main ideas for both intervention groups over 
the course of the training program. We also expected that students 
in the combined intervention group (SRL + TEXT) would show a 
greater preference for self-regulated learning in comparison with 
the students in both other groups, both immediately after the 
intervention and in the follow-up test. Practicing metacognitive 
and text reduction strategies simultaneously appears to be more 
effective than only working on text strategies (cf. Dignath et al., 
2008). We thus expected that the number of identified main ideas 
would increase more for the students in the combined intervention 
group (SRL + TEXT) over the course of the 7 weeks than it would 
in the group of students who only received the text strategy 
intervention (TEXT). As the focus of our intervention was mainly 
on basic text reduction strategies (cf. Cantrell, Almasi, Carter, 
Rintamaa, & Madden, 2010) and as standardized reading compre- 
hension tests additionally measure other aspects of reading com- 
prehension not explicitly covered in our intervention, we consid- 
ered these tests to be transfer measures and expected to find weak 
to moderate effect sizes for our two intervention groups (cf. 
Souvignier & Mokhlesgerami, 2006). We expected the best per- 
formance in the reading comprehension test for students in the 
combined intervention group, followed by students in the text- 
strategy-only intervention group, whom we expected to perform 
better than students in the regular instruction group. 


Method 


Participants and Design 


Participants were 763 fourth-graders in 33 classrooms in urban, 
suburban, and rural areas in southern Germany. The students were 
on average 9.80 years old (SD = 0.43); there was even gender 
distribution (48.89% girls), Among the participating students, 
21.23% had a migration background (MB); that is, they themselves 
or at least one of their parents had not been born in Germany. The 
most common languages MB students learned as children were (in 
descending order) Russian, Turkish, Italian, Albanian, Serbian, 
and Bosnian. None of the students in our sample were rated by 
teachers as having difficulties understanding spoken German. Ta- 
ble 1 provides additional information about the MB students. As 
the linguistic backgrounds of the students varied but all were fluent 
German speakers, we had no a priori expectations about the effect 
of the students’ migration status. 

In our quasi-experimental study (Gliner, Morgan, & Leech, 
2009), students in intact classrooms were recruited via the local 
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Table 1 


Demographic Information by Treatment Condition 


Leanne ee ee ee ee 








Condition 

SRE ET TEXT REG Total 

Demographic information (n = 229) (n = 268) (n= 266) (n= 763) 

Mean age in years (SD) 9.89 (0.44) 9.80 (0.43) 9.74 (0.40) 9.80 (0.43) 
Percentage of girls 48.03 50.75 47.74 48.89 
Percentage of MB students (overall) 38.86 8.58 18.80 22s 

Percentage of MB students who 

Were not born in Germany NO 13.04 22.00 19.50 
Use German as their primary language at home 47.20 78.26 66.00 57.41 
Speak German at home at least sometimes 94.38 95.65 96.00 95.06 








Note. SRL = self-regulated learning; TEXT = text reduction strategies; REG = regular classroom instruction; 
MB = migration background (the student and/or at least one of his or her parents were not born in Germany). 


education authorities, who also gave us permission to conduct this 
study. The local education authorities offered all fourth-grade 
teachers in their district the opportunity to participate in an eval- 
uation study of a classroom-based text-strategy training program 
as part of their professional development requirements. We semi- 
randomly assigned interested teachers, all of whom were certified 
elementary school teachers with at least 10 years of teaching 
experience to the three conditions (two intervention conditions and 
one regular instruction condition). When more than one teacher 
was participating at the same school, we assigned these teachers 
either to the same intervention condition or to the regular instruc- 
tion condition, such that teachers were not aware that different 
versions of the program were being implemented. The teacher 
sample represented a total of 22 schools, with SRL + TEXT 
teachers distributed across eight of the schools and TEXT teachers 
distributed across nine of the schools. Four of the REG teachers 
taught in the same school as one of the SRL + TEXT teachers, and 
eight REG teachers taught in schools where there were no 
intervention-condition teachers. Teachers who were assigned to 
the regular instruction condition (under the pretense that we had a 
maximum number of participants and had raffled off the spots) 
were given the chance to receive the training materials after the 
evaluation study ended, and we promised them preferential admis- 
sion to future workshops. At the end of the study, we debriefed all 
teachers about the study design and offered them feedback on the 
results of students in their own classrooms. Teachers’ and stu- 
dents’ participation in the evaluation study was voluntary, and both 
participants and their parents consented to participation. Teachers 
also informed the students’ parents about the program. 

We implemented a pretest (Time 1, or T1), posttest (T2), and 
follow-up test design (T3) with three conditions: In the final 
sample, nine~ classrooms participated in the full training condition, 
practicing both self-regulated learning and text reduction strategies 
(SRL + TEXT). Twelve classrooms participated in the text- 
strategy-only condition (TEXT). The students in this condition 
received the same training as the full training group, but without 
the specific self-regulation components of the training. Students 
from 12 additional classrooms received regular instruction (REG). 
Table 1 shows our sample’s demographic information by treatment 
condition. The evaluation of our study included two aspects: We 
conducted a summative evaluation for all three conditions at three 
measuring points with the help of standardized reading tests and 
questionnaires; we also carried out a process evaluation in the two 


training conditions to evaluate the students’ progress in finding 
main ideas in daily texts over the course of the training. 


Procedure 


Teacher workshops. Before implementing one of the two 
versions of the training program (SRL + TEXT or TEXT, see later 
detailed description), each group of intervention-condition teach- 
ers attended a workshop designed to prepare them for administer- 
ing their respective version of the training program in their class- 
rooms. As teachers were to conduct evaluations in their classrooms 
themselves, they also learned how to administer the measurement 
instruments. Teachers in the regular instruction condition (REG) 
only learned how to administer the measurement instruments. The 
workshops were held by the first two authors of this report. 

The 2-day workshop for the teachers in the SRL + TEXT 
condition covered theoretical information on text reduction strat- 
egies and self-regulated learning on the first day and the specific 
training program on the second day. Teachers received training 
materials for their students and discussed how they would admin- 
ister them in their classrooms. They also received a teachers’ 
manual documenting the concepts covered in the workshop and 
containing checklists of the materials to be covered on each day of 
the program (cf. Stoeger & Ziegler, 2008b). 

As teachers in the TEXT condition did not learn about self- 
regulated learning, their workshop lasted only 1 day. These teach- 
ers received exactly the same instruction and material on text 
reduction strategies as teachers in the SRL + TEXT condition. 

Instruction in the three intervention conditions. Instruction 
was delivered by fourth-grade classroom teachers during regular 
classroom hours. As the expository texts used in both intervention 
conditions dealt with topics from the natural sciences, the training 
was conducted mainly during reading instruction and instruction in 
basic science. The students in the regular instruction condition 
received a comparable amount of curriculum-based instruction in 
reading and basic science. As some classrooms with regular in- 
struction were from the same schools as the training classrooms, 
we asked teachers in these classrooms not to employ any of the 


? Originally, there were 12 participating classrooms in each of the three 
conditions. In the SRL + TEXT condition, three classroom teachers from 
one school decided on short notice and for reasons unrelated to the training 
program not to participate in the program. 
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material provided by us for the training classrooms during the 
study, but to teach their students as they normally would. 

Teachers started administering their training program at dates 
scheduled shortly after the respective workshops. We provided all 
teachers of all three groups with contact information so that they 
could contact the first two authors of this report in the event that 
they were to have further questions regarding the implementation 
of the respective training program or the evaluation. Teachers in 
the intervention conditions could also contact their participating 
colleagues from the same group. Four (SRL + TEXT) or 3 weeks 
(TEXT) into the training program, we met with teachers in each of 
the intervention conditions in order to discuss practical issues of 
administering the program and to answer questions. 

Training program in the SRL + TEXT condition. Classroom 
teachers in the SRL + TEXT condition implemented a 7-week 
program in which students practiced text reduction strategies as an 
integral part of self-regulated learning exercises (Stoeger & 
Ziegler, 2008b). The program included daily activities for regular 
classroom instruction and for homework. By completing the pro- 
gram, the students systematically practiced all phases of the cycle 
of self-regulated learning described in the Introduction section. 

The training program consisted of 2 informational weeks and, 
thereafter, 5 learning-cycle weeks. During the informational 
weeks, students spent approximately 45-60 min of instruction 
time per day on the training program. During the learning-cycle 
weeks, the time spent on the training varied between approxi- 
mately 40 min on Tuesdays, Wednesdays, and Thursdays and 
approximately 60 min on Mondays and Fridays. 

During the first informational week, students learned why it is 
important to understand texts, what main ideas are, how they can 
identify them in expository texts, and how they can differentiate between 
main ideas and less important passages. Students received a one- 
page summary on how to identify main ideas; they were encour- 
aged to refer to this summary throughout the program whenever 
they felt the need to do so. Teachers also presented and modeled 
three reduction strategies that are useful for identifying and dis- 
playing main ideas: (a) underlining and copying main ideas ver- 
batim, (b) drawing a mind map containing main ideas, and (c) 
summarizing main ideas in one’s own words. Students received a 
one-page summary on each strategy and were given the opportu- 
nity to practice each strategy on a short expository text (approxi- 
mately 200-240 words). 

During the second informational week, teachers introduced the 
self-regulated learning cycle by Ziegler and Stoeger (2005). For 
the students, the cycle was called the learning circle and was 
illustrated with cartoon-style pictures of Zumpel the Mouse who 
described all seven phases of the circle as a first-person narrator. 
Using this instructional material, students created their own learn- 
ing circles and hung them up at home. Their hand-made learning- 
circle illustrations as well as the illustrations provided in the 
training program materials were meant to ensure that students 
would have frequent and easy access to visualizations of the 
learning circle and its individual phases while working through the 
training program. Teachers also used the second informational 
week to discuss the phases of self-regulated learning with their 
students; they used various examples drawn from everyday situa- 
tions such as completing homework or practicing a certain sports 
skill. At the end of the second informational week, teachers pro- 
vided their students with information on effective goal setting and 


discussed common goal-setting mistakes with their students. As 
students should become aware of the relationship between using 
learning strategies and achieving goals and as this is a very 
demanding task for fourth graders, we asked students to set rela- 
tively simple quantitative outcome goals. Finally, teachers in- 
formed their students about the structure of the training program 
planned for the upcoming weeks. 

During the following weeks, the learning-cycle weeks, the stu- 
dents repeatedly and consciously worked through all phases of the 
learning cycle. Every school day, students were to read an expos- 
itory text about a topic from the natural sciences (e.g., fungi and 
mushrooms; rainbows; desert plants; blood) and then to identify 
the 10 main ideas. The texts were designed especially for use in the 
training program and to adhere to the following criteria: Each text 
was about 420 words long and contained 10 main ideas as well as 
several distractor sentences (see online supplemental material for a 
sample text). All texts were of a comparable difficulty level. The 
texts received a mean score of 69.16 (SD = 3.73) on the German 
version of the Flesch readability index (Amstad, 1978), which 
corresponds to the difficulty level found in fifth-grade textbooks. 
These design criteria were set to ensure (a) that the texts would 
offer all students—including strong readers—the best possible 
chance of benefiting from applying, monitoring, and adjusting 
their strategy use and (b) that all students would be able to 
establish a clear connection between improved strategy use and 
better results. During the learning-cycle weeks, students kept a 
structured learning journal that accompanied them as they pro- 
gressed through the learning cycle. 

At the beginning of each learning-cycle week, students set a 
specific outcome goal for themselves that specified how many 
main ideas (10 being the maximum) per daily text they aspired to 
find. The students were encouraged to set goals for themselves that 
were challenging but achievable. They noted their goals in their 
learning journal, and they also wrote down what strategy they 
planned to use in order to achieve their goal. During learning-cycle 
Weeks 1-3, one of the three previously introduced text strategies 
for identifying and displaying main ideas was prescribed by the 
program per week: underlining and copying verbatim for the first 
learning-cycle week, mind mapping for the second, and summa- 
rizing for the third. This way, all students had the opportunity to 
practice each strategy systematically. In the remaining 2 learning- 
cycle weeks (learning-cycle Weeks 4 and 5), students chose strat- 
egies that they felt had been particularly helpful during the previ- 
ous weeks and/or strategies from which they felt they could profit 
from continued practice of their effective implementation. During 
each of the 5 weeks, the students used their journals to keep track 
of how exactly they planned to use the strategy they were focusing 
on, how their monitoring worked, and what strategy adaptations 
they made. 

In order to help students in the SLR + TEXT classrooms better 
understand the text strategies introduced during the first training 
week in the context of self-regulated learning, we incorporated an 
additional, story-based reading activity into learning-cycle Weeks 
1, 2, and 3. We prepared four age-appropriate stories written in a 
more informal style in which the cartoon character Zumpel the 
Mouse served as a model of self-regulated learning use. In reading 
these “‘self-regulated learning stories,” the students accompany 
Zumpel as the mouse works on the aforementioned strategies: 
Zumpel self-assesses the learning process (Text 1) and then tries 
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out, monitors, and adjusts the underlining strategy (Text 2), the 
mind mapping strategy (Text 3), and the summarizing strategy 
(Text 4). 

The students received one expository text per school day. They 
read the daily text silently and then had the opportunity to ask their 
peers and teacher about unfamiliar words. Then, before taking the 
text home and working further with it, they noted in their learning 
journal how many main ideas they thought they would find in that 
text (10 being the maximum number). At home, they used that 
week’s strategy to identify and display the main ideas in the text. 
Students spent between 20 and 30 min on this homework assign- 
ment. Right after having finished this part of their homework 
assignment, they evaluated how well their strategy worked on that 
day and wrote down in their learning journal how they wanted to 
improve their strategy use the next day. The next day, the home- 
work assignment was discussed in class. Teachers based this 
discussion on the sample solutions they had received as part of the 
teachers’ manual. The students noted in their learning journal how 
many of the main ideas they actually found. In a teacher—class 
dialogue, the teacher addressed the connection between strategy 
use and outcome. Students were encouraged to use their experi- 
ence with the text from the previous day to improve their strategy 
when working on the next text. 

Each Friday, Thursday’s homework assignment was discussed 
first. Then, the students worked on a new text during classroom 
instruction. After discussing results and strategy use for this new 
text, the teacher initiated a discussion about learning behavior, 
strategy use, and results in the current week. We integrated appro- 
priate prompts into the students’ learning journals to help facilitate 
this reflection process. The students thus also took time during 
classroom instruction on Fridays to summarize the current week in 
their journals. Based on this summary, teachers discussed the 
learning behavior with their students and how they could use their 
experience from this week to improve their learning behavior in 
the following week. 

Training program in the TEXT condition. Teachers in the 
TEXT condition used the same materials and methods as teachers 
in the SRL + TEXT condition with one exception: They did not 
employ the materials on or make explicit references to self- 
regulated learning. As the TEXT-condition teachers did not intro- 
duce the concept of self-regulated learning to their students (In- 
formational Week 2 in the SRL + TEXT condition), the duration 
of the TEXT-condition training program was reduced to 6 weeks. 
During an informational week, students in the TEXT condition 
learned—as did the students in the SRL + TEXT condition—why 
it is important to understand texts, what main ideas are, how they 
can identify them in expository texts, and how they can differen- 
tiate main ideas from less important passages. They also received 
the one-page summary on how to identify main ideas. Teachers 
in the TEXT condition also introduced and modeled the same three 
text reduction strategies used in the SRL + TEXT condition. 

Then, during the subsequent five practice weeks, students ap- 
plied the strategies to one expository text per school day by 
working to identify the 10 main ideas within each text. As in the 
SRL + TEXT condition, teachers discussed the correct solutions 
of this homework assignment with their students. However, stu- 
dents were not encouraged to use any self-regulated learning 
strategies. Students in the TEXT condition neither read self- 


regulated learning stories nor kept learning journals. Table 2 shows 
the two intervention conditions in comparison. 

Instruction in the REG condition. Students in the REG con- 
dition received regular classroom instruction in reading and basic 
science in accordance with the curriculum. The curriculum explic- 
itly lists the use of text strategies such as underlining, making 
graphic representations, and summarizing as part of the reading 
instruction and summarizing basic scientific texts as part of the 
basic science instruction. Moreover, the legally binding state cur- 
riculum of the region where the study was conducted explicitly 
encourages teachers to emphasize self-regulated learning as the 
basis for lifelong learning and as a means of transferring more 
responsibility for the learning process onto the students. Within the 
confines of the curriculum, teachers in the regular instruction 
conditions could adjust their teaching to the needs of their students. 
Students spent between 20 and 30 min on their reading and basic 
science homework assignments each day. 


Table 2 
Two Intervention Conditions in Comparison 


SR se TEL TEX 


Informational weeks Informational weeks 


Week 1 Week 1 
Why understand texts aad 
How to find main ideas nee 
How to use text reduction 

strategies = 
Week 2 Week 2 


Self-regulated learning Why understand texts 
— How to find main ideas 
— How to use text reduction 
strategies 
Practice weeks 
Daily tasks 
Reading 
Read text 
Use text reduction strategy 
Find 10 main ideas 


Learning-cycle weeks 
Daily tasks 
Reading 
Read text 
Use text reduction strategy 
Find 10 main ideas 
SRL 
Self-assessment 
Strategy monitoring 
Strategy adjustment 
Outcome evaluation 
Weekly tasks (Weeks 3-7) 
SRL 
Goal setting 
Strategic planning 
Outcome evaluation (reflection) 


Week 3 Week 3 
Underlining Underlining 
SRL stories — 
Week 4 Week 4 
Mind mapping Mind mapping 
SRL story — 
Week 5 Week 5 
Summarizing Summarizing 
SRL story — 
Week 6 Week 6 
Applying a text reduction strategy Applying a text reduction 
of choice strategy of choice 
Week 7 Week 7 
Applying a text reduction strategy Applying a text reduction 
of choice strategy of choice 


Note. SRL = self-regulated learning; TEXT = text reduction Strategies. 
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Treatment fidelity. All teachers indicated in their checklists 
that they used all training materials with the exception of one 
teacher who skipped one text due to a school activity day. The 
student training materials, which we collected from the students 
after the training programs were over, also suggest that the training 
programs were delivered as intended. Missing data in the student 
materials are in the range we expected for a practice period of 5 
consecutive weeks. From this evidence and from more information 
collected in personal communication with teachers, we concluded 
that the interventions were implemented as intended. 

Program evaluation. The testing sessions for the summative 
evaluation were scheduled during regular classroom hours in the 
week before the training started (T1), in the week after it con- 
cluded (T2), and another 11 weeks later (T3). The sessions were 
led by trained research assistants or by the specially trained class- 
room teachers. 

At T1, students filled out the questionnaire on their preference 
for self-regulated learning during one 35-min testing session and 
completed the reading comprehension test and questions on de- 
mographic information in another testing session that lasted 25 
min. At T2 and T3, the testing sessions lasted 35 min (question- 
naire) and 75 min (reading comprehension test), respectively. To 
ensure comparable testing conditions, teachers and research assis- 
tants followed a detailed manual and read out instructions verba- 
tim. The instrument for the process evaluation was integrated into 
the training material in both intervention conditions. An overview 
of our measurement schedule is provided as online supplemental 
material, Table S1. 


Measures 


Measures used in the summative evaluation. We measured 
the preference for self-regulated learning at T1, T2, and T3 with 
the 28 items of the Fragebogen Selbstreguliertes Lernen—7, or 
FSL-7 [Questionnaire of Self-Regulated Learning—7] by Ziegler, 
Stoeger, and Grassinger (2010). The FSL-7 is based on Ziegler’s 
and Stoeger’s (2005) seven-step cyclical model of self-regulated 
learning. In the questionnaire, four school-relevant situations are 
described briefly: studying for school, preparing for the upcoming 
school year during the summer holidays, preparing for an in-class 
test, and catching up on school work after an illness. In each 
situation, the students are asked to indicate their preferred method 
of learning for each of the seven steps of self-regulated learning 
(self-assessment, goal setting, strategic planning, strategy imple- 
mentation, strategy monitoring, strategy adjustment, and outcome 
evaluation) by choosing one of three alternatives: self-regulated, 
externally regulated, or impulsive learning. The following is a 
sample item (Situation 1, Step 2: Goal setting): 


How do you study for school? (a) I set a fixed goal for myself 
describing what and how much I want to study [self-regulated learn- 
ing]; b) My teacher or parents should tell me which goal I should set 
for myself [externally regulated learning]; c) When studying, I don’t 
set a specific goal for myself. I can rely on my intuition [impulsive 
learning behavior]. 


In the present study, the research assistant or the classroom teacher 
read the four situations and the response alternatives out loud, 
ensuring that everyone, including weak readers, could complete 
the questionnaire both accurately and quickly. 


In the present study, we restricted our interest to the preference 
for self-regulated learning. Therefore, we calculated an overall 
score by counting the frequency with which a student chose 
self-regulated learning and divided it by the number of items 
answered. For ease of understanding, the scores are reported as 
percentages. For example, a student who chose the self-regulated 
learning option in 13 out of the 28 items would be given a score of 
46.43%. The internal consistency came to .85 at T1, .91 at T2, and 
$2) ie 1S, 

At Tl, we measured reading comprehension with the text com- 
prehension section of the Ein Lesetest fiir Erst-bis Sechstklassler, 
or ELFE 1—6 [Reading Test for First to Sixth Graders], by 
Lenhard and Schneider (2006). In this section of the ELFE 1—6, 
students have 7 min to read 13 short texts (15-56 words) and 
answer a total of 20 multiple-choice questions. According to the 
authors of the test, students require different levels of reading skills 
to answer the questions. The skills are: finding information (five 
items), intersentential reading (eight items), and inferential reading 
(seven items). For the purpose of this study, we calculated the 
overall reading score (range: 0—20 points). Cronbach’s alpha 
came to .82 in our sample. 

We had originally planned to use the ELFE test at T2 and T3 as 
well. However, as we encountered unexpected ceiling effects at T1 
(39.20% of all students had a score of at least 18 out of 20 points, 
and 11.33% of all students had a perfect score of 20 points), we 
decided to use a different, more difficult test at T2 and T3 that was 
designed to assess similar aspects of reading and text comprehen- 
sion. We employed the text comprehension section of the Ham- 
burger Lesetest fiir 3. und 4. Klassen, or HAMLET 3—4 [Hamburg 
Reading Comprehension Test for Grades 3 and 4] by Lehmann, 
Peek, and Poerschke (2006), using Version A at T2 and Version B 
at T3. Time constraints prevented us from employing both the 
ELFE and the HAMLET. 

The text comprehension section of each HAMLET version 
comprises 10 texts: five expository texts, three so-called functional 
texts (e.g., recipes and timetables), and two narrative texts; the text 
length varies between 57 and 592 words. The test was adminis- 
tered in two parts: Students had 25 min to work on Texts 1—4, and 
after a 5-min break, they had another 40 min to work on Texts 
5-10. Students were asked to answer four multiple-choice ques- 
tions per text. According to the test’s authors, students require 
different levels of reading skills to answer the questions. The skills 
are: simple finding of information (nine items in Version A, eight 
items in Version B), targeted finding of information (nine items in 
Version A, 10 items in Version B), combining/reconstructing (14 
items in both Versions A and B), and connecting/inferring (eight 
items in both Versions A and B). For the purpose of this study, we 
calculated the overall reading score (range: 0—40 points). Cron- 
bach’s alpha came to .90 at T2 and .92 at T3 in our sample. 

Measure used in the process evaluation. Students in both the 
SRL + TEXT and the TEXT conditions were asked to identify the 10 
main ideas in each of the 25 texts provided throughout the course of 
the training program (see previous section “Training program in 
the SRL + TEXT condition” for details). After the end of training, 
we collected all of the students’ training materials. Trained re- 
search assistants checked the number of correctly identified main 
ideas in each text (range: 0—10 main ideas), using a list of the 
correct main ideas for each text as a reference. After completing 
this rating process, we returned the training materials to the stu- 
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dents. As a measure for the process evaluation, we used the weekly 
average of the number of correctly identified main ideas, resulting 
in five values per student. 


Sample Drop-Out and Missing Data 


In terms of the summative evaluation, 13 students (1.7% of the 
sample) missed the reading test at Tl, 26 (3.4%) at T2, and 30 
(3.9%) at T3. Of these students, one missed the reading test both 
at Tl and T3, and seven missed the reading test both at T2 and T3. 
Eleven students (1.4% of the sample) missed the questionnaire 
about the preference for self-regulated learning at T1, 25 (3.4%) at 
T2, and 36 (4.7%) at T3. Of these students, one missed the 
questionnaire at both T1 and T3, and five both at T2 and T3. 

To handle missing data appropriately, we used state of the art 
methods (cf. Graham 2009; Schafer & Graham, 2002). As the 
program that we chose for the inferential analyses for the 
summative evaluation, HLM (Hierarchical Linear and Nonlinear 
Modeling software) Version 6.08 (Raudenbush, Bryk, Cheong, & 
Congdon, 2011), applies listwise deletion methods even if the 
full-information maximum-likelihood estimation (FIML) is chosen 
for regular two-level analyses, we used multiply imputed data sets 
for all inferential analyses with HLM (for all details regarding 
HLM analyses, see section “Overview of Statistical Procedures” in 
Results). A discussion of methods for multiple imputation of 
multilevel data is beyond the scope of this article but can be found, 
for example, in van Buuren (2011). We used the WinMICE soft- 
ware (Jacobusse, 2005) to generate five data sets under the hier- 
archical linear model. WinMICE makes use of a nested Gibbs 
sampler to estimate the parameters of the multilevel model for 
individual variables. We then analyzed the five sets simultaneously 
with HLM. 

We received training materials from 476 of the 497 students in 
both training groups (221 SRL + TXT, 255 TXT) for use in the 
process evaluation; 233 students completed all texts, 61 students 
missed only one text, 157 students missed two to seven texts, and 
25 students missed eight to 13 texts. Data of all students were 
included in further analyses. In terms of the different texts, there 
were between 2.9% and 22.1% missing data per text, with missing 
data below 10% for the first 18 texts and over 20% for only one of 
the texts in the final week of the training. To ensure consistency 
with our summative evaluation, we multiply imputed the missing 
data for the number of correctly identified main ideas with the 


WinMICE software. We then analyzed the five imputed data sets 
simultaneously with HLM. 


Results 


Descriptives and Zero-Order Correlations 


Table 3 shows descriptive statistics, proportions of between- 
classroom variance (the intraclass correlation, or ICC), and bivari- 
ate Pearson correlations for all measures used in the summative 
evaluation. The ICC indicates “the proportion of variance in the 
outcome that is between groups” (Raudenbush & Bryk, 2002, p. 
36) rather than between individuals. Students chose self-regulated 
learning as their preferred approach to learning for slightly more 
than one third of all FSL-7 items. The rather large standard 
deviation indicates large differences between students. Right after 
the training, a small portion (7.57%) of the variance was located 
between classrooms, rising to a medium portion (16.23%) at 
follow-up. Students scored on average 15.57 points in the ELFE 
reading comprehension test, which is slightly higher than fourth 
graders in the norm sample (cf. Lenhard & Schneider, 2006). In the 
HAMLET reading comprehension tests, students also scored 
slightly better than students in the norm sample (cf. Lehmann et 
al., 2006). For both data-collection points, the proportion of vari- 
ance between classrooms in the HAMLET was small (7.48% and 
3.78%). The measures of self-regulated learning at different data- 
collection points were correlated, as were the measures of reading 
comprehension at different data-collection points; the preferences 
for self-regulated learning and reading comprehension were not 
correlated. Table 4 contains means and standard deviations for the 
dependent variables, listed separately for each condition and data- 
collection point. Both original values and z-transformed values are 
presented. 

Table 5 contains descriptive statistics for the process evaluation. 
Both training groups started with slightly more than six correctly 
identified main ideas on average in the first week, with a slight 
advantage for the students in the TEXT condition. Over the course 
of the training program, students in the SRL + TEXT condition 
increased the number of correctly identified main ideas from week 
to week, suggesting a linear increase, whereas the number of 
correctly identified main ideas seems to remain rather constant in 
the TEXT condition. 











Table 3 
Descriptive Statistics, Proportions Between Classroom Variance, and Bivariate Pearson Correlations 
Variable Scale M SD ICC (%) 1 2 5 4 5 
1. Preference for self-regulated learning (T1) 0-100 35.99 20.89 — — 
2. Preference for self-regulated learning (T2) 0-100 38.17 25.66 Hog ollie — 
3. Preference for self-regulated learning (T3) 0-100 39.64 28.86 16.23 .49** .68** a 
4, Reading comprehension (T1, ELFE) 0-20 15.57 Boll — .06 06 .05 — 
5. Reading comprehension (T2, HAMLET A) 0-40 28.36 6.03 7.48 103 —.00 .03 Cm ~ 
6. Reading comprehension (T3, HAMLET B) 0-40 29.54 5.68 3.78 eas .02 .04 6") .68"" 





Note. ICC = intraclass correlation; Tl = Time 1; ELFE = Ein Lesetest fiir Erst- bis Sechstkliassler [Reading Test for First to Sixth Graders; Lenhard 
& Schneider, 2006]; HAMLET = Hamburger Lesetest fiir 3. und 4. Klassen [Hamburg Reading Comprehension Test for Grades 3 and 4; Lehmann, Peek, 


& Poerschke, 2006]. 
“™?p < .O1, two-tailed. 
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Table 4 


Descriptive Statistics per Condition and Point of Measurement 
9 eS WOT US lius ow ete 


Time | Time 2 Time 3 
Condition n M SD n M SD n M SD 











Original values 
Preference for self-regulated learning 


SRE + TEXT 228 38.15 2153 222 44.47 27.79 216 49.43 30.49 

TEXT 261 36.64 21.67 257 37.70 24.48 223 38.71 27.98 

REG 263 33.48 19.30 259 33.14 23.74 258 32.34 25.94 
Reading comprehension 

SRESE TEXT 225 15.42 3.76 218 21 sore} 6.41 213 28.92 6.10 

TEXT 262 15.38 3.46 262 28.59 5.93 261 2975 5.89 

REG 263 15.90 3.34 257 28.58 5.80 259 29.83 5.04 





z-transformed values 
Preference for self-regulated learning 


SRL + TEXT 228 0.10 1.03 222 0.25 1.08 216 0.34 1.06 

TEXT 261 0.03 1.04 257 —0.02 0.95 253 —0.03 1.97 

REG 263 (Oa 0.92 259 —0.20 0.93 258 5) 0.90 
Reading comprehension 

SRE + TEXT 225 —0.04 1.07 218 =(08 1.06 213 == (Oeil 1.07 

TEXT 262 —0.05 0.98 262 0.04 0.98 261 0.04 1.04 

REG 263 0.09 0.95 258 0.04 0.96 259 0.05 0.89 





Note. SRL = self-regulated learning; TEXT = text reduction strategies; REG = regular classroom instruction. 


Preliminary Analyses cantly in terms of reading comprehension pretest scores (p = .18; 
for means and standard deviations, cf. Table 4). To control for 
these individual variables, we included them as covariates in all 
inferential analyses. 


We used chi-square tests (for percentage data) and univariate 
analyses of variance (to compare means) to examine whether the 
three groups were comparable with regard to their demographic 
composition and their pretest scores. The groups did not differ 


; sledentaie Overview of Statistical Procedure 
significantly in their gender distribution (p = .75; SRL + TEXT Ai : : 


48.03%, TEXT 50.75%, REG 47.74% female) but differed signif- As we recruited students in intact classrooms for this study, we 
icantly in the proportion of students with migration background used hierarchical linear models (Raudenbush & Bryk, 2002) to 
(MB; p = .00; SRL + TEXT 38.86%, TEXT 8.58%, REG analyze our data. This method takes into account the fact that 
18.80%; using the Bonferroni correction, all three pairwise com- students within a classroom are more similar to each other than are 
parisons showed significant differences) and with regard to the randomly selected students and estimates standard errors associ- 


students’ mean age (p = .00; SRL + TEXT: 9.89 years; TEXT, ated with the regression coefficients in an appropriate way. We 
9.80 years; REG: 9.74 years; the Bonferroni post hoc test showed conducted all analyses with the software package HLM (Version, 
that only SRL + TEXT differed significantly from REG, but 6.08; Raudenbush et al., 2011), using the FIML algorithm for 


TEXT did not differ from the other two conditions). The groups model estimations. 

differed in terms of preference for self-regulated learning (p = .04; Summative evaluation. For the summative evaluation, we 
again the only significant difference in the Bonferroni post hoc test were interested in assessing the program’s effects on the students’ 
was between SRL + TEXT and REG) but did not differ signifi- preference for self-regulated learning and on their reading com- 
Table 5 


Descriptive Statistics for Number of Correctly Identified Main Ideas per Condition and Week 





Number of correctly identified main ideas 





Week | Week 2 Week 3 Week 4 Week 5 
Condition n M SD n M SD n M SD n M SD n M SD 





Original values 
SRE =F TEXT 220 6.05 1.78 219 6.10 ee 218 Ov 2ee 2.07) 218 6.72 ee BPAY) 7.30 1.78 
TEXT 255 6.24 1.54 253 5.58 1.63 255 6.57 1.58 251 6.19 Leer 244 6.44 1.72 





z-transformed values 


SRE TEXT 220 —0.06 1.07 219 0.16 1.03 218 0.04 1.13 218 0.15 1.07 220 0.25 0.99 


TEXT 255 0.05 0.93 ZS —0.14 0.95 255 —0.04 0.87 2a ie Osilies 0.92 244 = 0525 0.96 
el elrnce SI R  0c APME chRs aA  S R 


Note. SRL = self-regulated learning; TEXT = text reduction strategies. 
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prehension immediately after the training and 11 weeks later. 
Therefore, we specified four sets of models (two dependent vari- 
ables X two time points) in which we regressed the respective 
individual outcome variable on individual (Level 1) and classroom 
(Level 2) predictors. As all students from one classroom were in 
the same training condition, we specified the training condition on 
the classroom level. 

We calculated four models for each set. We used the uncondi- 
tional model (Model 0) to calculate the intraclass correlation (ICC) 
of the outcome variable. In Model 1, we included all Level-1 
covariates, namely, gender, age, MB, pretest score in the prefer- 
ence for self-regulated learning, and pretest score in reading com- 
prehension. In a first step, we allowed intercepts and slopes to vary 
between classrooms. In a second step, we fixed slopes if they did 
not vary significantly (p < .10) between classrooms or if the 
reliability of the variable was less than .10 (cf. Cheung & Keeves, 
1990). This model served as a reference model for the other two 
models, which include Level-2 variables. In Model 2, we calcu- 
lated the effects of the two training conditions after controlling for 
individual variables. To this end, we included two dummy vari- 
ables, SRL + TEXT and TEXT, to specify a classroom’s adher- 
ence to a certain treatment condition, making the REG group the 
reference group. Again, we let slopes vary freely first and fixed 
them if they did not vary between classroom or if reliability was 
low. We calculated Model 3 to account for the fact that classrooms 
and training conditions differed substantially in their proportions 
of MB students. For this reason, we added the proportion of MB 
students as a Level-2 covariate. Model 3 shows the training effects 
for classrooms with an average proportion of MB students. 

We report fixed effects based on model estimation with robust 
standard errors. As the software package HLM Version 6.08 does 
not provide standardized beta coefficients, we standardized all 
continuous variables (measures of reading comprehension and 
preference for self-regulated learning) before entering them into 
the models for easier interpretation of effects. The intercept of the 
regression equations can now be interpreted as the mean for a male 
student without MB who is of average age and with an average 
preference for self-regulated learning and an average reading 
score; and the slope coefficients show by how much the dependent 
variable changes in terms of proportions of a standard deviation if 
a predictor changes by one unit. 

Process evaluation. The process evaluation was conducted to 
examine whether students in both training conditions identified an 
increasing number of main ideas across the course of the training. 
We assumed that students in both training conditions would be- 
come more proficient as the training progressed and that the 
increase in the number of correctly identified main ideas would be 
greater for the students in the SRL + TEXT condition. After 
inspecting the descriptive statistics for both groups, we used a 
linear growth model (cf. Raudenbush & Bryk, 2002) to predict the 
weekly average number of correctly identified main ideas with the 
five time points per student on Level 1, students on Level 2, and 
classrooms on Level 3. We used the original metric (number of 
correctly identified main ideas) for the outcome variable (as op- 
posed to z-standardized values) to allow for the modeling of actual 
growth. All continuous covariates were z-standardized. We coded 
the time points from 0 (Week 1) to 4 (Week 5) so that a coefficient 
of 0 for the linear time parameter yields the initial status, that is, 
the average number of correctly identified main ideas during the 


first of the five learning-cycle weeks. The weekly growth rate 
(slope) is indicated by the value for the linear time parameter. In 
a manner similar to our approach in the summative evaluation, we 
modeled student characteristics as covariates on the student level 
(here, Level 2) and the training condition as a dummy variable on 
the classroom level (here, Level 3). 

Also in a manner similar to the procedure in the summative 
analysis, we calculated four models. From the unconditional model 
(Model 0), we took estimates for the variance components in both 
the intercept (initial status; number of correctly identified main 
ideas) and the slope (weekly increase in the number of correctly 
identified main ideas). In Model 1, we included ‘all student-level 
covariates, namely, gender, age, MB, pretest score in the prefer- 
ence for self-regulated learning, and pretest score in reading com- 
prehension, both for the intercept and the slope. We allowed all 
covariates to vary between classrooms in a first step, but fixed 
them using the same criteria as in the summative evaluation in a 
second step. We also used this procedure for the two remaining 
models. Model | served as the reference model for Models 2 and 
3, in which we included classroom variables. In Model 2, we 
included training condition as predictor on the classroom level. As 
we compared only the two training conditions with each other in 
this analysis, one dummy variable (SRL + TEXT) was sufficient. 
Thus, the TEXT group became the reference group in this analysis. 
Finally, we also controlled for the proportion of MB students per 
classroom in Model 3. 


Summative Training Effects 


We present the results of the hierarchical regression analyses in 
two sections. First, we describe the results regarding the students’ 
preference for self-regulated learning, both right after the training 
and 11 weeks later. Second, we present the results regarding the 
students’ reading comprehension, again for both the posttest and 
the follow-up test. 

Preference for self-regulated learning. The results of the 
two-level analyses of the preference for self-regulated learning are 
presented in Table 6, with posttest results in the upper half and 
follow-up results in the lower half. Model | serves as a reference 
model and contains only individual input variables. As expected, 
the preference for self-regulated learning at T1 is a strong predictor 
for the preferences for self-regulated learning both at T2 and T3. 
In addition, there is a trend indicating that girls’ preference for 
self-regulated learning generally increased more from T1 to T2 
than that of boys. We did not, however, find the same effect at T3. 
At T3, the preference for self-regulated learning was instead 
slightly higher for older than for younger students. When we 
introduced classroom-level predictors in Models 2 and 3, the 
values of the individual predictor variables remained roughly the 
same. In Model 2, we found a small effect of the combined training 
condition (SRL + TEXT) on the students’ preference for self- 
regulated learning at the posttest and a medium effect at follow-up. 
As expected, there were no significant training effects on the 
preference for self-regulated learning for the TEXT condition. 
Inclusion of the training conditions as predictors in the model 
explained almost 40% of the classroom-level variance in Model 1 
for the posttest and almost 35% for the follow-up test. Controlling 
for the proportion of MB students per classroom in Model 3 
enhanced the effects of the SRL + TEXT training condition both 
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Table 6 
Results of the Two-Level Analyses for the Preference for Self-Regulated Learning 














Model 1 Model 2 Model 3 
Variable B SE Re B SE R B SE Re 
SRL at posttest (Time 2) 
Intercept WOE) 0.06 —0.16 0.08 =(0)17) 0.08 
Levell 
Pretest SRL 0.60* 0.03 0.60° 0.03 0.60° 0.03 
Pretest reading 0.04 0.03 0.03 0.03 0.03 0.03 
Gender female 0.09 0.06 0.09" 0.06 0.09* 0.06 
Migration 0.00 0.09 —0.04 0.09 —0.02 0.09 
Age 0.03 0.03 0.02 0.03 0.02 0.03 
Level 2 
SRE TEXT ORs) 0.11 0.39" 0.13 
TEXT 0.06 0.10 0.03 0.10 
Migration (agg.) (O07) 0.05 
R? Level 1 87.5370 37.52% 37.52% 
R? Level 2 — 39.47% 45.18% 
Deviance 1778.00 1769.09 1767.37 
SRL at follow-up (Time 3) 
Intercept —0.04 0.07 = 0125 0.16 (7) 0.11 
Level 1 
Pretest SRL 0.46" 0.03 0.46" 0.03 0.46" 0.03 
Pretest reading 0.04 0.03 0.03 0.03 0.03 0.03 
Gender female 0.07 0.06 0.08 0.06 0.08 0.07 
Migration —0.04 0.07 —0.08 0.08 —0.04 0.08 
Age 0.10* 0.03 0.09" 0.03 0.09* 0.03 
Level 2 
SRL + TEXT OEce 0.15 0.68* 0.15 
TEXT 0.18 0.15 0.11 0.14 
Migration (agg.) ce (OC 0.07 
R? Level 1 25.32% 21570 25.17% 
R? Level 2 — 34.37% 41.17% 
Deviance 1871.00 1859.91 1856.32 
Note. N = 763 students from 33 classrooms. Values for Level-1 variables are set in italics if slopes varied freely between classrooms. SRL = 
self-regulated learning; TEXT = text reduction strategies; agg. = aggregated (the proportion of migration background students per classroom was 
aggregated from individual student data on migration background status). Variance-explained statistics were computed from the variance components with 
the following equations: Reever 1 = ( o{unconditional model] — o*[fitted model])/o?(unconditional model). Reever 2 = (T 9 [Model 1] — To [Model with 
Level-2 variables]})/tp, (Model 1). 
ip Ol ee p< 205. 


at posttest and follow-up. The proportion of explained classroom- 
level variance rose to over 45% at the posttest and to over 41% at 
the follow-up test. 

Reading comprehension. Table 7 shows the results of the 
two-level analyses for reading comprehension. In Model 1, the 
pretest reading comprehension scores strongly predict reading 
comprehension scores both at the posttest and at the follow-up. By 
contrast, MB students scored significantly and considerably lower 
on the reading comprehension test at the posttest and at the 
follow-up, even though the pretest scores were controlled. In 
addition, younger students scored slightly better both at the post- 
test and at the follow-up test. Finally, girls achieved slightly better 
reading scores than boys at the follow-up. The values of the 
individual predictor variables changed very little when we intro- 
duced classroom-level predictors in Models 2 and 3. Introducing 
the training conditions in Model 2 did not unveil any training 
effects on reading comprehension. Neither the SRL + TEXT 
condition nor the TEXT condition had a positive effect on stu- 
dents’ reading comprehension, and that is true for both the posttest 
and the follow-up test. The introduction of the intervention vari- 


ables explained only a very small portion of the Level-2 variance 
in Model 1 (3.64% for the posttest and 0.55% for the follow-up 
test). However, when we added the proportion of MB students as 
class-level predictor, the effectiveness of the combined training 
program emerged. In that case, students in the SRL + TEXT 
condition scored significantly higher for reading comprehension 
than students in the other two conditions at posttest. At follow-up, 
the effect remained visible as a trend (p = .06). In Model 3, some 
of the Level-2 variance in Model | is explained (21.27% for the 
posttest, 15.82% for the follow-up). 


Process Training Effects 


The variance decompositions into within- and between-school 
components in Model 0 showed significant variation among chil- 
dren within classrooms and significant variation between class- 
rooms both for initial status and for the weekly growth rate. For 
initial status, 16.34% of the variance was between classrooms, and. 
for the weekly growth rate, 29.81%. The fact that classrooms 
differed more over the course of time than in initial status is not 
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Table 7 
Results of the Two-Level Analyses for Reading Comprehension 
Model 1 Model 2 Model 3 
Variable B SE Re 6B SE Re B SE Re 
Reading at posttest (Time 2) 
Intercept 0.04 0.06 0.02 0.08 —0.00 0.08 
Level 1 
Pretest SRL —0.01 0.03 —0.02 0.03 SOO 0.03 
Pretest reading 0.56" 0.37 0.56" 0.04 0.56" 0.04 
Gender female 0.10 0.06 0.10 0.06 0.10 0.06 
Migration —0.41* 0.10 —0.42" 0.10 O56) 0.10 
Age = Ol05i 0.03 —0.06 0.03 —0.06 0.03 
Level 2 
SRE DED, 0.06 0.12 0.19" 0.09 
TEXT 0.03 0.09 —=():05 0.09 
Migration (agg.) aa alom 0.06 
R* Level 1 42.22% 42.21% 42.30% 
R* Level 2 = 3.64% 21.27% 
Deviance 1744.08 1743.70 1736.27 
Reading at follow-up (Time 3) 
Intercept 0.00 0.06 0.02 0.06 —0.04 0.07 
Level 1 
Pretest SRL —0.06" 0.03 —0.06" 0.03 =O 5)" 0.03 
Pretest reading 0.57" 0.04 0.57" 0.04 0.57" 0.04 
Gender female 0.14 0.08 0.15" 0.08 Os" 0.08 
Migration ae 0.09 aa Dee 0.10 Oar 0.11 
Age =(.05i 0.03 01051 0.10 —0.05" 0.03 
Level 2 
SINE: Se HEDGE 0.04 0.08 a" 0.07 
TEXT 0.03 0.08 —0.04 0.10 
Migration (agg.) a Online 0.04 
R° Level 1 40.37% 40.37% 40.51% 
R? Level 2 — 0.55% 15.82% 
Deviance 1781.08 1780.81 ot 
Note. N = 763 students from 33 classrooms. Values for Level-1 variables are set in italics if slopes varied freely between classrooms. SRL = 
self-regulated learning; TEXT = text reduction strategies; agg. = aggregated (the proportion of migration background students per classroom was 
aggregated from individual student data on migration background status). Variance-explained statistics were computed from the variance components with 
the following equations:; Reever 1 = (o*[unconditional model] — o7 [fitted model])/o*(unconditional model); Reve: 2 = (To9 [Model 1] — too [Model with 


Level-2 variables])/t9, (Model 1). 
ier Oe par Ds 


surprising: As the classrooms were assigned to different treatment 
conditions, different growth rates were to be expected. 

The results of the growth model analysis estimating the increase 
of correctly identified main ideas in both training conditions are 
displayed in Table 8. In Model 1, only individual student charac- 
teristics were included. Reading pretest scores and gender posi- 
tively predicted initial status, meaning that students with higher 
reading test scores as well as girls identified more main ideas 
correctly in the first week of the training. None of the individual 
covariates significantly predicted the linear trend in the course of 
the training, although there was a very small trend showing that the 
number of correctly identified main ideas increased less in the 
course of the training for older students. Introducing classroom- 
level variables into the model in Models 2 and 3 did not appre- 
ciably change the values of the individual predictors. Model 2 
shows that the number of correctly identified main ideas in the first 
week was not predicted by treatment condition, indicating no 
significant differences between the SRL + TEXT and the TEXT 
group at the start of training. For the slope, we found a small effect 
for the SRL + TEXT condition: For students in this group, the 


number of correctly identified main ideas increased by roughly one 
third (0.10 + 0.21 = 0.31) of a main idea per week; in the TEXT 
condition, on the other hand, the number of correctly identified 
main ideas increased by only one tenth (0.10) of a main idea per 
week. The model estimated that by Week 5, students in the SRL + 
TEXT condition identified an average of 1.24 main ideas more 
than in Week 1, whereas students in the TEXT condition identi- 
fied, on average, only 0.40 main ideas more than in their first week 
of training. These results remained stable when we controlled for 
the proportion of MB students per classroom in Model 3. The 
training condition remained the sole significant predictor of the 
growth rate in the course of the training and explained almost 50% 
of the between-classroom variance in the slope. 


Discussion 


The current study was conducted with two main aims. From a 
theoretical perspective, the purpose was to assess the additional 
benefit of teaching text reduction strategies embedded in a training 
program focused on a normative model of self-regulated learning 
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Table 8 
Results of the Three-Level Growth Analysis for Correctly Identifying Main Ideas 








Model 1 Model 2 Model 3 
Variable b SE R b SE R b SE R° 
Intercept (initial status) 5.78 0.20 2.76 0.20 79 0.26 
Level 2 (student) 
Pretest SRL 0.08 0.07 0.08 0.07 0.08 0.07 
Pretest reading 0.41* 0.08 0.41* 0.08 0.41* 0.08 
Gender female O05 0.15 (O}S0)" 0.15 0.50* 0.15 
Migration = (0,22 0.19 = (123 0.20 (0) 24 0.21 
Age 0.06 0.05 0.06 0.05 0.06 0.05 
Level 3 (classroom) 
SRE LEX 0.03 0.33 —0.04 025 
Migration (agg.) 0.05 0.22 
Slope (growth rate) 0.18 0.04 0.10 0.04 0.09 0.05 
Pretest SRL 0.02 0.02 0.02 0.02 
Pretest reading 0.00 0.03 0.01 0.02 
Gender female 0.01 0.04 0.01 0.04 0.01 0.04 
Migration 0.03 0.04 0.00 0.04 0.00 0.04 
Age —0.02* 0.01 —0.03* 0.01 —0.03" 0.01 
Level 3 (classroom) 
SRit ean OI 0.06 (22 0.08 
Migration (agg.) —0.01 0.05 


R? Level 2 intercept 19.60% 19.61% 19.61% 
R? Level 2 slope 1.64% 1.24% 1.26% 
R? Level 3 intercept — 0.00% 0.00% 
R? Level 3 slope — 48.89% 49.55% 
Deviance 8065.17 8050.44 8050.17 





Note. N =2,380 time points from 476 students in 21 classrooms. All Level-2 and Level-3 predictor variables were fixed. SRL = self-regulated learning; TEXT = 
text reduction strategies; agg. = aggregated (the proportion of migration background students per classroom was aggregated from individual student data on 
migration background status). Variance-explained statistics were computed from the variance components with the following equations: Rive) 2 = (a7 
{unconditional model] — o7[fitted model])/o* (unconditional model). Reeve, 3 = (t [Model 1] — + [Model with Level-3 variables])/t (Model 1). 
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that students systematically study and proceduralize. To this end, 
we compared one group of students who learned text reduction 
strategies while also working on a self-regulated learning training 
routine (SRL + TEXT) with a second group who completed a 
training program focused exclusively on text reduction strategies 
(TEXT). From a more practical perspective, the purpose of this 
study was to examine the benefit of teacher-led text-reduction- 
strategy interventions (SRL + TEXT and TEXT) compared with 
regular classroom instruction (REG). 

Our results generally confirm the effectiveness of the SRL + 
TEXT intervention and the advantage of this combined interven- 
tion over the pure text reduction strategy intervention (TEXT) and 
over regular classroom instruction (REG). In particular, the fol- 
lowing findings apply to the three dependent variables we studied: 
First, as expected, both intervention groups showed linear in- 
creases in the number of main ideas identified in expository texts 
over the course of the respective intervention. We observed greater 
increases among the students in the combined intervention group 
(SRL + TEXT) than among those in the text-reduction-strategy- 
only group (TEXT). During the final week of the intervention, 
children in the combined intervention group identified almost one 
main idea more per expository text than the children in the text- 
strategy-only intervention group. This finding is consistent with 
the results of meta-analyses indicating that children in grade 
school, in particular, do best when they have the chance to work on 
a combination of cognitive and metacognitive strategies (Dignath 
et al., 2008). Other studies with a comparable evaluation design 


(e.g., Stoeger & Ziegler, 2008a) have also revealed continuous 
improvements across an entire 5-week span of training. But unlike 
earlier studies, our current study documented training improve- 
ments that did not decline at the end of the intervention. This result 
suggests that the students’ grasp on the text reduction strategies 
continuously improved and that they were also sufficiently moti- 
vated to continue using these strategies through to the very end of 
the training phase. 

Second, these effects on finding main ideas in texts only transferred 
to higher scores in standardized reading comprehension tests in class- 
rooms with no more than an average proportion of MB students. 
Further analyses comparing only students without a migration back- 
ground revealed treatment effects in the standardized reading com- 
prehension test in the combined intervention condition (SRL + 
TEXT). Non-MB students in the SRL + TEXT group demonstrated 
better reading comprehension at the posttest (Cohen’s d = 0.25) than 
the non-MB students in the group with regular classroom instruction. 
This advantage remained at the follow-up measurement, although it 
became less substantial (Cohen’s d = 0.10). 

Students with MB in this intervention group performed nearly as 
well at mastering the skill of finding main ideas as those without 
MB, but the MB children were largely unsuccessful at applying 
this skill in the new context of reading comprehension in the 
standardized tests. One possible explanatory factor might be vo- 
cabulary deficiencies, especially concerning specific terminology. 
For the daily texts that students in both intervention groups worked 
on, teachers explained difficult words to the children before they 
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began working on the texts. This was not the case, however, for the 
standardized tests. It seems plausible that students with MB may 
have had special deficits in their language skills and breadth of 
vocabulary (Baumert & Schiimer, 2001; Heinze, Herwartz-Emden, 
& Reiss, 2007) that might have influenced reading comprehension 
(cf. Ouellette, 2006). 

Third, as expected, study participants who worked on text 
reduction strategies in the context of the seven-step cycle of 
self-regulated learning (SRL + TEXT) demonstrated an increased 
preference for self-regulated learning immediately after the train- 
ing. The study participants in both of the comparison groups 
(TEXT and REG), on the other hand, showed no such changes. 
The effect we observed for the combined intervention group 
(SRL + TEXT) increased again from the posttest to the follow-up 
measurement 11 weeks after the training. 

For the combined intervention condition (SRL + TEXT), the 
preference for self-regulated learning increased even more for 
students with MB than for those without. When we compared the 
MB children in this group to the MB children in the regular 
instruction group, we observed a preference-rating increase from 
the first to the second measurement of Cohen’s d = 0.50; for the 
children without MB, the effect only came to Cohen’s d = 0.10. 
Both the MB and non-MB children showed even stronger prefer- 
ences for self-regulated learning at our follow-up 11 weeks after 
the training. In comparison to MB children in the regular instruc- 
tion group, MB children in the combined intervention group 
showed an increase in the strength of their preference for self- 
regulated learning from the first to the third measurement of 
Cohen’s d = 0.64; for non-MB children, the same comparison 
yielded a value of Cohen’s d = 0.30. 


General Conclusion 


Taken together, we come to the conclusion that these findings 
add to our understanding of how to increase older elementary 
students’ preference for self-regulated learning and how to help 
them improve their ability at finding main ideas within an ecolog- 
ically valid learning setting (De Corte, 2000). A comparison of the 
combined intervention group (SRL + TEXT) with the text- 
strategy-only intervention group (TEXT) and with the group re- 
ceiving regular instruction (REG) shows that practicing text re- 
duction skills within the framework of a normative model of 
self-regulated learning provides an additional benefit for elemen- 
tary school children. 

The positive development of students in the combined interven- 
tion is likely a result of the fact that the intervention adheres to four 
factors that researchers have identified as being beneficial (e.g., 
Dignath & Biittner, 2008; Ramdass & Zimmerman, 2011; Schunk 
& Rice, 1987; Weinstein, Husman, & Dierking, 2000): (a) We 
introduced the students in the combined intervention condition 
(SRL + TEXT) to a normative model of self-regulated learning, 
and they practiced each of the steps described in the model over 
the course of several weeks with concrete content and concrete 
strategies; (b) the intervention took place in more than one setting, 
namely, during regular classroom instruction in basic science and 
reading and during homework; (c) the intervention included vari- 
ous illustrations of the benefit of self-regulated learning; and (d) 
over the course of several weeks, students received systematic 


feedback regarding their learning behavior and the relationship 
between this behavior and their achievements. 

The effect sizes for finding main ideas through the combined 
intervention are comparable to—and the effect sizes for preference 
for self-regulated learning are somewhat greater than—those re- 
ported for earlier teacher-led interventions (Dignath & Biittner, 
2008). The effect sizes for text comprehension are, however, 
somewhat smaller than in other previous studies (e.g., Paris, Cross, 
& Lipson, 1984). This difference may reflect the fact that we used 
standardized tests rather than tests designed specifically for our 
study. Researcher-designed tests tend to require less transfer of 
skills from one domain to another (e.g., Kim et al., 2004). Never- 
theless, the obtained training effect sizes are lower than those 
reported for researcher-led training programs in small group set- 
tings (e.g., Dignath & Biittner, 2008). 


Limitations and Future Directions 


Finally, we mention a number of limitations of our study. A first 
concern is about the assessment of self-regulated learning. With 
our assessment of the number of main ideas participants found 
over the course of the training and with the standardized reading 
tests, we established objective criteria for assessing achievement. 
Due to economic constraints, however, self-regulated learning was 
only assessed with a questionnaire. Thus, we did not measure 
students’ actual behavior but rather their subjective assessments of 
their own behavior (cf. Artelt, 1999, 2000). In the case of the 
students in the combined intervention group in particular, this 
assessment approach can lead to distortions since these students 
may, by learning about self-regulated learning, be more prone to 
providing answers that they perceive as being socially desirable. 
For this reason, students’ learning journals should be systematically 
evaluated in future research (cf. Schmitz, Klug, & Schmidt, 2011). 
Doing so should offer more insight into self-regulatory behavior 
during the training phase and provide some indication of the extent to 
which the self-assessments made in response to the relatively general 
questions in the questionnaire correspond with the journal entries 
(e.g., regarding self-monitoring and strategy adaptation) during the 
intervention. This brings us to a second specific recommendation for 
future work in this area: As learning-journal entries also reflect 
subjective assessments of one’s own behavior, it would be helpful to 
also assess students’ learning behavior using other approaches such as 
a microanalytic assessment method (Cleary, 2011), a think-aloud 
method (Greene, Robertson, & Croker Costa, 2011), or an in-depth 
case study method (Butler, 2011). 

Another limitation is the possible occurrence of a Hawthorne 
effect, in that teachers changed their teaching behavior because 
they knew that their classrooms were being studied, not because of 
the specific intervention they received. However, the occurrence of 
a Hawthorne effect would have not been confined to the interven- 
tion groups because teachers in the REG condition also knew that 
their classrooms were being studied, such that they could have 
tried to improve their regular instruction in their own ways, thus 
making it harder to see intervention effects. 

A final concern lies in the fact that we did not explicitly monitor 
the instruction that the participating teachers carried out during the 
intervention. Teachers did fill out daily checklists on the materials 
they used and on the aspects of the intervention which they dealt 
with, and the results suggest that the interventions fulfill criteria of 
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high treatment integrity. However, we do not have systematic data 
about how much time teachers spent working on each topic, which 
methods they preferred (e.g., group work or direct instruction), or 
how didactically effective their instructional approach was. We 
also did not collect data on the teachers’ attitudes about self- 
regulated learning or about the intervention. In future research in 
this area, investigators may be able to incorporate the use of 
trained observers and/or video recording. In addition, asking teach- 
ers about their attitudes toward self-regulated learning, the inter- 
vention they are involved in, and its actual execution as well as 
testing their knowledge of self-regulated learning should provide 
important information about the conditions under which an inter- 
vention can be most effective. 

In summary, the results of this study as well as those of other 
studies offer reason to be optimistic that self-regulated learning 
can be successfully introduced and practiced during classroom 
instruction and homework (cf. Ramdass & Zimmerman, 2011: 
Stoeger & Ziegler, 2011). The transfer of newly acquired self- 
regulated-learning knowledge and its proceduralization into skills 
is best facilitated by a combination of various intervention mod- 
ules that employ various contents (e.g., mathematics, expository 
texts, vocabulary lists) and strategies (e.g., time management, text 
strategies, rehearsal strategies) within the framework of a norma- 
tive model. In the future, researchers will need to (a) examine the 
efficacy of individual intervention modules, (b) better understand 
the conditions under which these modules are effective, and (c) 
look for evidence of both the advantages and the concrete effect of 
sequentially introducing and practicing the individual modules. 
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Can Babies Learn to Read? A Randomized Trial of Baby Media 
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Targeted to children as young as 3 months old, there is a growing number of baby media products that 
claim to teach babies to read. This randomized controlled trial was designed to examine this claim by 
investigating the effects of a best-selling baby media product on reading development. One hundred and 
seventeen infants, ages 9 to 18 months, were randomly assigned to treatment and control groups. Children 
in the treatment condition received the baby media product, which included DVDs, word and picture 
flashcards, and word books to be used daily over a 7-month period; children in the control condition, 
business as usual. Examining a 4-phase developmental model of reading, we examined both precursor 
skills (such as letter name, letter sound knowledge, print awareness, and decoding) and conventional 
reading (vocabulary and comprehension) using a series of eye-tracking tasks and standardized measures. 
Results indicated that babies did not learn to read using baby media, despite some parents displaying 


great confidence in the program’s effectiveness. 
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There has been an explosion of new media targeted to babies in 
recent years (Rideout & Hammel, 2006). Ignited by the 1995 
release of Brainy Baby, followed by a deluge of other videos and 
DVDs such as Baby Einstein, educational media specially mar- 
keted to families of infants and toddlers has become big business, 
with the Baby Einstein brand alone selling over $200 million 
worth of products (Robb, Richert, & Wartella, 2009). A substantial 
proportion of these new media claim to promote infants’ cognitive 
development and vocabulary and feature testimonials and adver- 
tisements about how young children may benefit from these com- 
mercial products. Garrison and Christakis (2005), for example, 
reported that of the top 100 best-selling baby DVDs on Amazon 
.com, 76 claim to produce specific developmental benefits. Ac- 
cording to recent reports, these claims appear to be reaching an 
increasingly receptive audience: The average 6-month-old is said 
to own at least four DVDs (Barr, Danziger, Hilliard, Andolina, & 
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Ruskis, 2009), with this figure almost doubling by the time infants 
are 18 months old (Linebarger & Vaala, 2010). 

A small proportion of the makers of these baby media, however, 
go beyond promoting developmental benefits to argue for their 
product’s ability to teach babies to read. Claiming that “all babies 
are Einsteins,” the manufacturers of Intellectual Baby, for exam- 
ple, make the case that babies can learn to read beginning at age 3 
months (Intellectual Baby, 2009). Similarly, recognizing that ba- 
bies are “linguistic geniuses,” the makers of Brill Baby (“kids are 
brilliant’; BrillKids, 2011) promote using their Little Readers and 
Little Musicians series starting at infancy. Other products, like 
Your Baby Can Read (www.ybcr.com; Titzer, 2010), claim that 
toddlers will be able to read Charlotte’s Web and Harry Potter if 
regularly exposed to their program as early as age 3 months. 
“Teaching your baby to read is easy,” one product claims (Doman 
& Doman, 2010). “It depends on the brain’s ability to integrate its 
visual, auditory, linguistic and conceptual centers. And that speed 
depends a great deal on the myelination of the neuron’s axons . . 
. the more myelin sheathes the axon, the faster the neuron can 
conduct its charge. In short, the earlier we can teach babies to read, 
the better” (Intellectual Baby, 2009). 

Although skeptics abound (American Academy of Pediatrics, 
1999), a small number of empirical studies have begun to examine 
this assumption, targeting their focus specifically on word learn- 
ing. Vandewater (2011), for example, in a randomized experiment, 
reported that infants exposed to Baby Wordsworth for | month 
showed greater gains in receptive vocabulary at a 3-month 
follow-up compared with controls, although their assessment of 
gains relied on parent report rather than direct assessments with 
children. In contrast, Robb et al. (2009) found no greater gains in 
receptive and expressive vocabulary for 12- and 15-month-olds 
who viewed the same product for 6 weeks versus a nonviewing 
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control group. Similarly, DeLoache and her research team (DeLo- 
ache et al., 2010) reported no difference in vocabulary gains for 
12- to 18-month-old children who viewed the same product several 
times a week over a month’s time versus a control group. 

Nevertheless, there are limitations to these studies that might be 
contested by program developers. For one, many of these products 
recommend a longer duration of treatment than multiple exposures 
over | month’s time. Your Baby Can Read, for example, suggests 
that parents are likely to experience the benefits for their child only 
after 3-7 months of daily use (Titzer, 2010). Second, many top- 
selling products include more than DVDs; for example, Intellec- 
tual Baby includes flashcards and books that accompany the pro- 
gram’s DVD. And third, to date, studies have only measured gains in 
receptive and expressive vocabulary development. Reading profi- 
ciency, as we later define, involves more than oral word learning. 

Therefore, one could make a case that claims about babies’ 
abilities to learn to read have not been subject to rigorous empirical 
testing. However, some might argue that the question is moot, 
given the reported difficulty that infants and toddlers have learning 
from screen media (Anderson & Pempek, 2005). For example, one 
reason young children may struggle to learn from video presenta- 
tions is that they do not display dual representation, or the under- 
standing that pictures and videos depict not only objects in and of 
themselves but also represent similar objects in their world (De- 
Loache, 2004). Development of dual representation helps support 
children’s generalization and transfer of content from screen pre- 
sentations to other situations. Linebarger and Vaala (2010) argued 
that learning and extending new vocabulary, in particular, may be 
especially dependent on dual representation, as this skill requires 
infants to simultaneously represent an object on screen both as its 
own entity and as a symbol for similar objects in other environ- 
ments. 

A substantial amount of research suggests that infants and 
toddlers may not learn information as readily from screen media as 
from a live situation, a phenomenon known as the video deficit 
(Anderson & Pempek, 2005). The deficit has been found using a 
variety of language outcomes, including phonetic learning (Kuhl, 
Tsao, & Liu, 2003), connecting actions with a novel word (Rose- 
berry, Hirsh-Pasek, Parish-Morris, & Golinkoff, 2009), and iden- 
tifying object labels (Krcmar, Grela, & Lin, 2007; Troseth, 
Strouse, Verdine, & Saylor, 2013). Nevertheless, despite young 
children’s inefficient learning from video, studies that have com- 
pared learning from video instruction with no instruction have 
suggested that infants and toddlers are capable of learning infor- 
mation from video (e.g., Barr & Hayne, 1999; Strouse & Troseth, 
2008), especially with supportive scaffolds such as co-viewing and 
repeated viewings (Barr, Muentener, Garcia, Fujimoto, & Chavez, 
2007; Strouse & Troseth, 2013). In addition, Rice (1983) identified 
a number of supportive features of videos themselves including the 
use of predictable program formats, recasts, simple sentences, and 
slow rates of speech, some of which are prevalent in baby media 
products. 

Despite varied research findings, many parents retain the belief 
that their children benefit from watching television. For example, 
a recent survey indicated that 40% of mothers of young children 
believe that their infants and toddlers are learning from screen time 
(Rideout, 2007). Therefore, it is conceivable that certain precursors 
of reading (e.g., phonological awareness) are developing but may 
not be apparent due to limitations in the methods traditionally used 


to assess early learning. Consequently, this article was designed to 
take marketers at their word and to measure the effects of baby 
media using a more comprehensive model of reading. Conducting 
a year-long randomized controlled trial in which we compared 
families who used a baby media product with relatively high 
fidelity with control families, we examined the effects of baby 
media on babies’ reading development. 


What Is Reading? 


Critical to a study of reading development is the very definition 
of “what is reading?” We use the widely accepted definition of 
reading reported in government consensus documents as well as in 
a convergence of studies (National Early Literacy Panel, 2008; 
National Reading Panel, 2000). Reading is a complex cognitive 
process of decoding symbols for the intention of deriving meaning 
from print. Although it has its detractors, the “simple view” most 
succinctly characterizes the process (Gough & Tunmer, 1986): 
Reading with understanding is the product of decoding and com- 
prehension, or the simple formula, R = D X C. In actuality, 
however, this definition is hardly simple because it suggests a 
multiplicative effect: Decoding by itself is not reading; similarly, 
comprehension of words without the ability to unlock words into 
their constituent parts is not reading. Both must work in concert for 
individuals to be able to read with meaning. 

This definition contrasts with those who might argue that iden- 
tifying words in context such as “McDonald’s” or “Stop” in stop 
signs (otherwise known as environmental print) are indicators of 
real reading (Goodman, 1984). However, two independent stud- 
ies—Masonheimer, Drum, and Ehri (1984) and Stahl and Murray 
(1993)—have shown that while young children as early as age 242 
years could readily and accurately identify many logos, they could 
not read the embedded words when they were removed from the 
logos. Although these behaviors, sometimes described as pseudo- 
reading or pretend reading (Teale & Sulzby, 1989), may highlight 
children’s interest or awareness of symbol systems in their envi- 
ronment, they do not constitute reading or the ability to read words 
accurately in isolation or in text with meaning. 

Nevertheless, the development of reading ability is not an all- 
or-nothing phenomenon. Rather, researchers agree that there are 
developmental phases that emerge or change based on internal 
causes, such as developing cognitive or linguistic capabilities, and 
on external environmental conditions, such as the scaffolding 
kinds of adult activities that support its development. In fact, there 
is substantial agreement among theories of reading development 
(Chall, 1983; Ehri, 1979; Mason, 1980) that children move from 
contextual dependency to early decoding, where they become 
“glued to print” by processing letters and sounds in an effortful 
manner, ultimately to fluent and conventional reading. Such de- 
velopmental theories can provide researchers with a basis for 
assessing development and for predicting what can be expected in 
the developmental path toward reading. 

Consequently, our analysis of whether babies can learn to read 
was based on the premise that there are developmental skills that 
act as precursors to reading proficiency. Using this logic, we 
would presume that even if babies could not read with understand- 
ing as a result of the baby media program (i.e., as indicated by 
program developers’ claims) they might at least exhibit some of 
their earlier skills that could accelerate later reading development. 
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Consistent with this thesis, therefore, in this article, we examine 
reading developmentally from the emerging to the consolidation 
phases of reading. 


Method 


Participants 


Participants were 117 infants, ages 10-18 months (M = 14.25 
months), and their parents from a small Midwestern city and the 
surrounding county. These families were recruited through a va- 
riety of sources, including flyers distributed in the community 
(e.g., churches, day care centers), displays at community events 
(e.g., public libraries, farmers’ markets), social networking (e.g., 
Facebook), and word of mouth. We used the following inclusion 
criteria for participation in the study: (a) Babies were born full- 
term (i.e., > 37 weeks gestation), (b) heard English as their 
primary language (i.e., < 10% exposure to languages other than 
English), and (c) did not have a history of vision, hearing, or 
cognitive disabilities. Together, our sample included 61 male and 
56 female infants, representing 88% White, 3% African American, 
and 9% bi/multiracial groups. 

Infants were randomly assigned to one of two conditions (treat- 
ment or control) within counterbalanced gender and age (three 
clusters: 10.0-12.9; 13.0-15.9; and 16.0—18.9 months) brackets. 
The distribution of infants in each condition did not vary across 
demographic characteristics (all ps > .05). See Table 1 for demo- 
graphic information. 


Table 1 


The majority of infants were from middle-class, highly educated 
parents. Half of the families earned an annual income larger than 
$76,000 and 16% earned less than $30,000. Seventy-eight percent 
of mothers and 75% of fathers had at least a bachelor’s degree. 
Ninety-two percent of parents were married (single: 6.6%; domes- 
tic partners: 1.1%). 

Almost half of the infants were first born and attended some 
form of day care at the start of the study: 13% in their own home 
(e.g., looked after by a grandparent), and 34% went to a day care 
center). The quality of infants’ home environments was relatively 
high: 11% of families got the maximum score (i.e., 45 points) on 
the Infant/Toddler Home Observation for Measurement of the 
Environment Inventory (Caldwell & Bradley, 2003). Another 83% 
of the families scored close to the maximum score (i.e., 39-44 
points). Bayley scores in cognition, language, and social— 
emotional development were in the average range (Bayley, 2006). 


Intervention Materials 


The intervention included a best-selling baby media product, 
Your Baby Can Read, sold at popular chain stores (e.g., Walmart, 
Target) as well as through online vendors. Marketed to children 
ages 3 months and older, it purports to teach babies how to read 
with fluency and comprehension within 3-6 months of regular 
use. The intervention is composed of five volumes or units; each 
volume includes a specific number of words to be learned, ranging 
from 20 to 27 words. In Volume 1, for example, babies are 
expected to learn 22 written words, ranging from two to eight 


Demographic Characteristics of Treatment and Control Children (N = 117) 


Characteristic Treatment (n = 61) Control (n = 56) Da 
Age (in no. of months) 
Initial visit 14.28 (2.60) 14.21 (2.70) 887 
Final visit 22.04 (2.81) 21.46 (2.91) 294 
Gender (%) 236 
Male 57.4 46.4 
Female 42.6 SLO) 
Ethnicity (%) 194 
White 83.6 92.9 
African-American 4.9 — 
Bi-/multiracial 11.5 AA 
Language exposure (%) .676 
English only 81.7 78.6 
Multiple 18.3 21.4 
Siblings (%) 849 
Only child 47.5 43.6 
1 39.0 40.0 
2 13.6 16.3 
Attends child care 58.3 44.6 140 
Parental education (Mdn) Bachelor’s degree Bachelor’s degree .283 
Household income (Mdn) $76,000—$ 100,000 $51,000-$75,000 825 
Infant/Toddler HOME score 41.69 (2.74) 41.69 (3.29) oii 
Bayley Scales percentile 
Cognitive 56.36 (23.19) 51.28 (24.43) 251 
Language 42.07 (25.70) 40.38 (23.80) 714 
Social-emotional 48.99 (28.02) 57.70 (28.43) .100 


aha aca cee ne SSNS 
Note. Standard deviations are presented in parentheses; HOME = Infants and Toddlers Version of the Home 
Observation for Measurement of the Environment inventory (Caldwell & Bradley, 2003), maximum score = 45; 


Bayley Scales = Bayley Scales of Infant and Toddler Development (3rd ed.; Bayley, 2006). 


“ py reported for f test or chi-square test of treatment versus control groups. 
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letters in length. According to the documentation included with the 
intervention, words are selected carefully—although no rubric for 
selection is provided. Across the different materials, similar words 
are presented in different fonts and different colors. 

Each volume of the intervention includes a DVD, word cards, 
picture cards, and a picture book. According to the parent guide 
included with the intervention, infants should watch the DVD at 
least one time per day and devote at least 15 min with each of the 
other three materials. 

DVDs. Each 20-min DVD volume begins with a brief intro- 
ductory segment in which infants are specifically instructed to 
attend to the text appearing on-screen. DVDs then follow a rou- 
tinized format. Lowercase printed words (e.g., ears) are shown 
centered against a solid colored background, taking approximately 
one third of the visual space. After a 2-s delay, an adult voiceover 
says a familiar carrier phrase (e.g., “Can you say [target word]’’) 
followed by a child’s voice saying the word that appears on screen. 
After approximately 2 s, a second child voiceover repeats the 
onscreen word. Each time the printed word is spoken, a cursor 
appears beneath the word and travels the length of the word from 
left to right. The printed word then disappears and is replaced with 
a scene depicting the meaning of the word (e.g., a child pointing to 
her ears) accompanied by an adult voiceover describing the scene 
(e.g., “Katie is pointing to her ears”). For each volume, this pattern 
(i.e., printed word followed by scene depicting the word) occurs 
between 60 and 70 times, with each of the target words presented 
only one time or as many as five times. Each DVD also includes 
three brief songs or nursery rhymes (e.g., “The Itsy Bitsy Spider’). 

Word cards. Each volume includes a set of 10-12 flashcards 
measuring 7.5 by 4.5 in. One word is printed on each side of the 
card (i.e., 20—24 words total). The words are printed in all low- 
ercase letters in black ink on a colored background; each word is 
written in a distinct font. 

Picture cards. Each volume includes five picture cards. Pic- 
ture cards measure 7.5 by 4.5 in. and have one word written on 
each side (1.e., 10 words total). Words are printed in lowercase 
letters in black ink; font and background color vary from word to 
word. A 1.5-in. tab on the right side can be pulled out to reveal a 
photograph of the referent against a white background. 

Picture books. A 16-page picture book is included with each 
volume of the intervention. Each page includes one target word 
printed on a large flap. The flap can be lifted to reveal a photo- 
graph of the referent. Fonts vary across words; photographs are 
presented in full color on white backgrounds. There is no addi- 
tional text other than the word on the page. 


Research Design 


Our research was designed to examine the four phases of read- 
ing, from pre-alphabetic to consolidated or conventional reading. 
Although there is substantial agreement among theories in the 
phases that distinguish its development, Ehri’s four-phase model is 
the most comprehensive in scope (Ehri, 1994), examining the full 
developmental period of learning to read. In her model, each phase 
of reading development is characterized by the predominant type 
of connection that binds written words to their other identities in 
memory: (a) pre-alphabetic, involving visual and contextual con- 
nections (e.g., using word configurations, or word length without 
any phonological information contributing to the association); (b) 


partial alphabetic, making connections between some salient let- 
ters and sounds (e.g., using the sound values of some letters to 
form connections between spellings and pronunciation of words); 
(c) full alphabetic, involving complete connections between graph- 
emes and phonemes (e.g., able to form connections between all 
graphemes in spellings and the phonemes in words, securing the 
words in memory); and (d) consolidated alphabetic, when reading 
becomes fluent and automatic (e.g., readers can read words as a 
whole rather than as a sequence of grapheme/phoneme units). 

Based on this model, our research design was to examine 
features of these developmental phases, reasoning that although all 
children in the sample might not become conventional readers 
(despite market claims), they would likely show evidence of ac- 
quiring some of the skills critical to its development (see Figure 1). 

In the present study, therefore, we used multiple methods of 
assessment including parent reports (language development), as 
well as 10 eye-tracking tasks throughout the course of 7 months. 
Eye tracking is a noninvasive methodology permitting high- 
resolution analyses of eye movement patterns. In addition to over- 
all visual preference, eye tracking allows a more precise analysis 
of how infants distribute their attention, such as where infants look 
(i1.e., scanning patterns) and how they shift their gaze from one 
location to another (i.e., saccade latencies; Gredebiack, Johnson, & 
von Hofsten, 2010). Recognizing that young children might ex- 
hibit implicit knowledge prior to explicit demonstrations, these 
gaze-based measures allowed us to tap visual preferences for 
orthographic knowledge (Golinkoff, Ma, Song, & Hirsh-Pasek, 
2013) and other related reading skills that might otherwise not be 
recognized in measures that require overt demonstrations (e.g., 
physical actions or verbal responses). 

Over the course of 7 months, we conducted a home visit, 
arranged for each participant to make four laboratory visits to 
engage in eye-tracking tasks, and performed monthly assessments 
of language development. In addition, we held bi-weekly tele- 
phone conversations with parents to assess fidelity. Table 2 sum- 
marizes these measures, which are described in the following text. 


Baseline Measures 


To ensure that our randomization procedures allowed for equiv- 
alence between conditions before the intervention, we visited 
families in their homes and administered the Bayley Scales of 
Infant and Toddler Development (Bayley, 2006), the HOME In- 
ventory (Caldwell & Bradley, 2003), and a demographic question- 
naire. 

Bayley Scales of Infant and Toddler Development. Infants’ 
general developmental functioning was assessed using the Bayley 
Scales of Infant and Toddler Development III (BSID-III), a de- 
velopmental battery yielding norm-referenced scores (Bayley, 
2006). Three domains were examined during the initial home visit. 
Infants were administered the Cognitive Scale and the Language 
Scale (including both the Receptive and Expressive Communica- 
tion subtests) by a trained research assistant, while parents com- 
pleted the Social—-Emotional Scale portion of the caregiver ques- 
tionnaire. The reported reliability for the BSID-III ranges from .87 
(fo) 2293). 

Infant/Toddler HOME inventory. The Infant/Toddler (IT) 
version of the Home Observation for Measurement of the Envi- 
ronment (HOME; Caldwell & Bradley, 2003) was administered at 
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Figure 1. Study outcome measures placed in Ehri’s (1994) four-phase model of reading. 


the initial home visit. The IT HOME uses both observation and 
interview to measure the quality and quantity of stimulation and 
support available to infants in their home environment in six areas: 
(a) Responsivity, (b) Acceptance, (c) Organization, (d) Learning 
Materials, (e) Involvement, and (6) Variety. Scores on the sub- 


scales were summed (maximum = 45) to yield an overall HOME 
score. 

Demographic questionnaire. Parents were asked to complete 
a 40-item demographic questionnaire. Items included family de- 
mographic information (e.g., ethnicity, native language), socioeco- 


Table 2 
Developmental Phases of Learning to Read and Their Associated Measures 
Developmental phase*/brief description Measure 
Pre-alphabetic 
Expressive vocabulary knowledge 
Speed and accuracy of spoken word recognition 
Partial alphabetic 
Receptive vocabulary knowledge 
Phoneme/grapheme correspondence 
Knowledge of graphemes 
Identification of written first name 
Knowledge of standard book format (e.g., upright 
orientation) 
Knowledge of the properties of letters and words 
Full alphabetic 
Recognition of previously read words 
Knowledge of words taught in the baby media program 
Knowledge of standard print format (e.g., right to left, 
top to bottom) 
Translation of text into words 
Consolidated alphabetic 
Expressive vocabulary knowledge 
Comprehension 


MacArthur CDI 
Speech processing efficiency 


Vocabulary knowledge 
Letter-sound knowledge 
Letter-name knowledge 
Name recognition 


Print awareness (orientation) 
Orthographic knowledge 


Sight word reading 
Target vocabulary knowledge 


Directionality of print 
Decoding 


MacArthur CDI 
Reading with meaning 


Note. MacArthur CDI = MacArthur—-Bates Communicative Developmental Inventories. 


Adapted from 


Fenson et al. (1994) 
Fernald et al. (2006) 
Meints et al. (1999); Swingley & Aslin (2000) 


Letter identification (Woodcock, 1987) 
Word attack (Woodcock, 1987) 
29) 


DeLoache, Uttal, & Pierroutsakos (2000) 
Cassar & Treiman (1997); Kaefer (2009) 
Ehri (1994) 

ab 

Clay (1979) 

Word attack (Woodcock, 1987) 


Fenson et al. (1994) 
pbc 


@ Adapted from Ehri (1994). ° Researcher-developed. © Adapted from promotional materials for the baby media program. 
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nomic background (e.g., parent education, family income), family 
structure (e.g., marital status, number of siblings), and educational 
activities (e.g., infants’ book-reading and baby media experi- 
ences). 


Exit Questionnaire 


At the end of the study, parents were asked to complete a 
16-item questionnaire indicating their beliefs about their children’s 
early literacy behaviors including their ability to write their name, 
recognize letters, and read. 


Child Assessments 


Children’s first lab visit occurred approximately 2 weeks after 
their initial home visit. Follow-up lab visits then occurred after 
another 3 months, 4 months, and 6 months, or after the treatment 
group finished their initial viewing of Volumes 2, 3, and 5, 
respectively. At these visits, we examined the four phases of 
reading development using a series of eye-tracking tasks. 

Eye-tracking technology. 

Apparatus. Eye movements were measured with a T120 eye 
tracker (Tobii Technology, Falls Church, VA) integrated into a 
17-in. thin film transistor (TFT) monitor (Psychology Software 
Tools, Pittsburgh, PA). This is a remote eye-tracking system that 
had no contact with the infant. The typical spatial accuracy of this 
system is approximately 0.5 visual degrees, and the sampling rate 
is 120 Hz. During tracking, the eye tracker uses infrared diodes to 
generate reflection patterns on the corneas of the infant’s eyes. 
These reflection patterns, together with other visual information 
about the infant, are collected by image sensors and used to 
calculate the three-dimensional position of each eye and gaze point 
on screen. The TFT monitor employs active matrix technology in 
which transistors control each pixel on the screen, improving 
image quality and contrast relative to passive-matrix technologies. 
The monitor has a display resolution of 96 pixels per inch, ensur- 
ing that images are discernible. 

This system uses a binocular tracking method, which allows for 
increased head movements. Head movements typically result in a 
temporary accuracy error of approximately 0.2 visual degrees. In 
the case of particularly fast head movements (i.e., over 25 cm/s), 
there is a 300-ms recovery period to full tracking ability. An 
embedded camera is also used to record infants’ reactions. 

General procedure. Infants sat in a high chair or on their 
parent’s lap approximately 60 cm from the monitor. Parents wore 
headphones and blinders to prevent any interference with their 
infant’s looking behavior. Stimuli were displayed on the Tobii 
monitor and a second monitor facing the experimenter. Tobii 
Studio Professional 3.0 software was used for stimuli presentation 
and data processing. 

To calibrate gaze, an attention grabber was shown at five points 
on the screen. A manual calibration procedure was used; accuracy 
was checked by Tobii Studio software and repeated as necessary. 
Following calibration, a 2-s attention grabber appeared in the 
center of the screen prior the beginning of each eye-tracking task. 
During the task, the experimenter monitored infants’ eye move- 
ments and behaviors using the live viewer. If infants became 
distracted during the video, the experimenter made a noise (e.g., 
snapping or shaking keys) behind the monitor to re-orient their 


attention toward the video. Total duration of each eye-tracking task 
was approximately 5 min. 

Each visit took approximately 45-60 min, including both fa- 
miliarization and testing. In the case when we used both interactive 
measures, such as the name recognition task and an eye-tracking 
task, infants would be given a break between tasks. When there 
were two eye-tracking tasks in a single visit, tasks would be 
presented consecutively (i.e., no breaks; total testing time approx- 
imately 10 min). However, breaks were always given if infants 
were fussy or if parents requested a break. 

Measures were presented in a set order across children. Within 
each task, trial order was randomized across children (except for 
name recognition, which was counterbalanced.) If a child was 
noncompliant on a task, the case would be eliminated for that 
particular measure (approximately 1% of the time). In the case 
when a child was entirely noncompliant, the visit would be re- 
scheduled. This only occurred twice (with the same family) over 
the course of the study. 

Children received a small gift, such as a book or toy, at the end 
of each visit. Parents were compensated $100 for participation in 
the study, $50 at the beginning and $50 at its completion. In 
addition, parents received travel expenses (i.e., $0.56/per mile and 
parking) and were allowed to keep the baby media product at the 
conclusion of the study. 

General data processing. Eye movement data were extracted 
using Tobii Studio Professional 3.0 software. Fixations were de- 
fined as any gaze coordinates lasting at least 60 ms and were 
identified using the Tobii Studio fixation filter. Adjacent gazes 
(1.e., gazes within a 0.5° radius, lasting less than 75 ms) were 
merged into a single fixation. 

Assessments at the pre-alphabetic phase. 

Expressive vocabulary knowledge. Infants’ expressive vocab- 
ulary knowledge was assessed using the short forms of the 
MacArthur—Bates Communicative Development Inventories (Fen- 
son, Pethick, Renda, & Cox, 2000). Parents of infants ages 16 
months and older were asked to indicate which of the words on the 
Level II form their child produced. Parents of infants younger than 
16 months of age completed the Level I form. To facilitate com- 
parison across forms, we computed percentile scores for each child 
based on his or her age and gender and limited our analysis to 
expressive vocabulary. 

Speech processing efficiency. Previous research has indicated 
that speed of word recognition during infancy is positively predic- 
tive of long-term lexical and grammatical development (Fernald, 
Perfors, & Marchman, 2006). To examine infant’s speech process- 
ing efficiency, we had the infants view 12 pairs of referents (i.e., 
target and foil) on the eye-tracking monitor. Referents were se- 
lected from published lexical norms (Dale & Fenson, 1996) as 
familiar to most infants in the age range of our sample. For each 
trial, a pair of 5 X 5 in. photographs appeared. After a pre-message 
baseline of 2,150 ms, a voiceover provided a directive (e.g., “Look 
at the car!”). Photographs then remained on the screen for an 
additional 2,750 ms. This procedure was then repeated for the 
remaining trials. Left-right orientation was counterbalanced and pre- 
sented in a set order; trial order was randomized across infants. 
Rectangular areas of interest (AOIs) were drawn around each referent. 
AOIs were kept the same size (461 < 457 pixels) for all photographs 
for consistency across trials. Fixation duration for each AOI during 
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the baseline phases and latency to first fixation on the AOIs during the 
test phase were exported for analysis. 

Assessments at the partial alphabetic phase. 

Receptive vocabulary knowledge. To assess infants’ receptive 
knowledge of vocabulary introduced in the baby media program, 
we created an eight-item measure modeled after Meints, Plunkett, 
and Harris (1999). For this task, infants viewed pairs of referents 
(i.e., target and foil) on the eye-tracking monitor. All referents 
were featured in the first two volumes of the baby media program. 
For each trial, a pair of 5 X 5 in. photographs appeared. Image 
resolution averaged 466 X 520 pixels. To avoid biases due to color 
preference, target/nontarget pairs were also broadly matched for 
color (e.g., a shoe and a hat that were both red). 

After a pre-message baseline of 2,150 ms, a voiceover provided 
a directive (e.g., “Look at the ear!”). Photographs remained on the 
screen for an additional 2,750 ms. This procedure was then re- 
peated for the remaining trials. Although target words are often 
presented multiple times in looking-while-listening paradigms 
(e.g., Fernald, Zangl, Portillo, & Marchman, 2008), we opted for 
using a single trial per word to potentially avoid task fatigue. 

Left—right orientation was counterbalanced and presented in a 
set order; trial order was randomized across infants. Rectangular 
AOIs were drawn around each referent; sizes were consistent 
(461 X 457 pixels) across trials. Fixation duration for each AOI 
during the baseline and test phases was exported for analysis. 

Letter-sound knowledge. To assess infants’ understanding of 
grapheme-sound correspondences, we created a six-item preferen- 
tial looking task modeled after the Letter-Word Identification 
subtest of the Woodcock Johnson III (Woodcock, McGrew, & 
Mather, 2001). For each trial, infants viewed a pair of lowercase 
letters. After a 2,150-ms pre-message baseline, a voiceover pre- 
sented a directive (e.g., “Look at the /b/!”). The procedure was 
then repeated for the remaining trials. Left-right orientation was 
counterbalanced and presented in a set order; trial order was 
randomized across infants. Rectangular AOIs were drawn around 
each referent; sizes were consistent (326 312 pixels) across 
trials. Fixation duration was then exported for each AOI during the 
baseline and test phases. 

Letter-name knowledge. Infants’ knowledge of grapheme— 
name correspondences was assessed through a six-item preferen- 
tial looking task adapted from the Letter-Word Identification 
subtest of the Woodcock Johnson III (Woodcock et al., 2001). For 
each trial, infants viewed a pair of lowercase letters. After a 
2,150-ms pre-message baseline, a voiceover presented a directive 
(e.g., “Look at the t!”). The procedure was then repeated for the 
remaining trials. Left-right orientation was counterbalanced and 
presented in a set order; trial order was randomized across infants. 
After drawing rectangular AOIs of 326 * 312 pixels around the 
referents, we exported fixation duration for each AOI during the 
baseline and test phases. 

Name recognition. Personal names are one of the first written 
words recognized by young children (Treiman, Cohen, Mul- 
queeny, Kessler, & Schechtman, 2007). To examine infants’ writ- 
ten name recognition, we created a four-item receptive task. For 
each trial, infants were shown two identical toys (i.e., two green 
cars or two yellow boats). One toy was labeled with the infant’s 
first name in 20-point font against a white background. The other 
toy was labeled with a pseudo-word matched in character length 
with the infant’s name (e.g., Nathan vs. Gombie). Infants were 


shown both toys and asked to select the one bearing their name 
(e.g., “Get Nathan’s car!”). If infants made a selection, they 
advanced to the next trial. If infants failed to make a selection or 
selected both toys, the trial was repeated. This procedure was 
repeated for a total of two car trials and two boat trials (counter- 
balanced for toy order and left/right placement). Trials were scored 
dichotomously (i.e., correct or incorrect), summed to yield an 
overall score, and converted into a proportion score. 

Print awareness (orientation). Research suggests that an un- 
derstanding of the canonical orientation of books and print may be 
one of the earliest-emerging domains of print awareness (DeLo- 
ache, Uttal, & Pierroutsakos, 2000). To examine infants’ under- 
standing of book and print orientation, we created a six-item 
preferential-looking task. For half of the trials, infants viewed pairs 
of book covers (i.e., upright and inverted); for the remaining trials, 
they viewed pairs of pseudo-words (1.e., upright and inverted). For 
each trial, infants were oriented to the screen by an attention 
grabber, and then a pair of images appeared. The trial lasted 10 s; 
there were no oral prompts. This procedure was then repeated for 
the remaining trials. Left-right orientation was counterbalanced 
and presented in a set order; trial order was randomized across 
infants. For book trials, AOIs of 450 < 503 pixels were drawn 
around each cover; for word trials, AOIs of 375 X 152 pixels were 
drawn over each word. Total fixation duration to each AOI was 
then exported. 

Orthographic knowledge. Recent work (Kaefer, 2012) sug- 
gests that young children’s understanding of orthographic conven- 
tions may emerge earlier than previously reported. We examined 
infants’ intuitive orthographic knowledge through a nine-item 
preferential-looking task. In three mirror image trials, we paired a 
pseudo-word that obeyed English orthographic conventions with 
its mirror image. In six illegal character trials, we paired ortho- 
graphically legal pseudo-words (e.g., pobe) with orthographically 
illegal versions (i.e., p#be). For each trial, infants were oriented to 
the screen by an attention grabber, followed by a pair of words. 
Trials lasted 10 s; there were no oral prompts. Left—-right orien- 
tation was counterbalanced and presented in a set order; trial order 
was randomized across infants. Rectangular AOIs were drawn 
over each word (438 X 159 pixels), as well as over the individual 
orthographically illegal characters (106 X 159 pixels). We then 
exported total fixation duration to each AOI. 

Sight word reading. We examined infants’ ability to represent 
familiar sight words in memory (Ehri & Robbins, 1992) through a 
six-item preferential-looking assessment. For each trial, infants 
viewed a pair of lowercase words (i.e., target and foil) that were 
featured in the baby media program. Following a 2,150-ms pre- 
message baseline, infants were presented with a directive (e.g., 
“Look at baby!”). Word pairs remained on screen for an additional 
2,750 ms. The procedure was then repeated for the remaining 
trials. Left-right orientation was counterbalanced and presented in 
a set order; trials were presented in random order across partici- 
pants. Rectangular 435 x 193 pixel AOIs were drawn around each 
word, and fixation duration to each AOI during the baseline and 
test phases was exported. 

Assessments at the full alphabetic phase. 

Target vocabulary knowledge. Previous research has sug- 
gested that young children may acquire little oral vocabulary 
knowledge from viewing infant-directed DVDs. To assess infants’ 
expressive knowledge of words introduced in the baby media 
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program, we created a 117-item checklist modeled after the 
MacArthur—Bates Communicative Development Inventories (Fen- 
son et al., 2007). The checklist consisted of all single words 
highlighted in at least one volume of the baby media DVD and 
included in the word cards. Parents were instructed to indicate all 
words (including their morphological inflections) currently said by 
their infants. Each checklist was summed to yield an overall score 
and converted into a proportion score. 

Directionality of print. Understanding the directionality of 
text is a key element of young children’s developing concepts of 
print (Justice & Ezell, 2001). In a nine-item task, we assessed 
infants understanding of left-to-right directionality of words (six 
trials) and top-to-bottom directionality of text (three trials). For 
word trials, infants viewed a single 27- to 30-character word (1.e., 
directionality) or string of symbols (i.e., no directionality). If 
infants understood the directionality of text, we would expect them 
to demonstrate a preference for the AOI for real words and no 
preference for any of the AOIs for meaningless strings of symbols. 
For text trials, they viewed several simple sentences taken from 
commercially available children’s books. For each trial, infants 
were first oriented to the screen by an attention grabber; the trial 
then lasted 10 s. There were no oral prompts. Trial order was 
randomized across infants. For word trials, the width of the mon- 
itor screen was divided into thirds, and a rectangular AOI (318 X 
151 pixels) was drawn across each third of the text. For sentence 
trials, the screen was divided into quadrants, and AOIs (481 x 360 
pixels) were drawn to cover each quadrant. Tobii Studio Profes- 
sional 3.0 software was then used to export the location of first 
look for each trial. 

Decoding. Conventional literacy is frequently characterized as 
the product of code-based skills and comprehension (Gough & 
Tunmer, 1986). To investigate infants’ decoding abilities, we 
constructed a six-item assessment modeled after the Word Attack 
subtest of the Woodcock Johnson II (Woodcock et al., 2001). For 
each trial, infants viewed a pair of words that were not included in 
the baby media program (i.e., target and foil). Following a 
2,150-ms pre-message baseline, infants were presented with a 
directive (e.g., “Look at cheese!”). The word pair remained on 
screen for an additional 2,750 ms. This procedure was then re- 
peated for the remaining trials. Left-right orientation was coun- 
terbalanced and presented in a set order; trials were presented in 
random order across participants. Rectangular 435 xX 193 pixel 
AOlIs were drawn around each pseudo-word, and fixation duration 
to each AOI during the baseline and test phases was exported. 

Assessments at the consolidated alphabetic phase. 

Expressive vocabulary knowledge. Expressive vocabulary 
knowledge was assessed using the short form of the MacArthur— 
Bates Communicative Development Inventories (Fenson et al., 
2000). At the final visit, all infants were age 16 months or older; 
therefore, all parents were asked to complete the Level II form. 
Percentile scores were calculated based on infants’ age and gender. 

Reading with meaning. Yo examine infants’ ability to com- 
prehend simple written phrases, we created a six-item task mod- 
eled after the promotional materials for the baby media program. 
All phrases were featured in at least one volume of the program 
DVD and word cards. 

To ensure that infants understood the task, we first administered 
two training trials. The research assistant held up a 5.5 X 8.5 in. 
white card printed with a target phrase in 72-point lowercase text. 


While running her finger left to right under the text, she read the 
depicted phrase aloud (e.g., “It says shake your head!”), orally 
repeated the target action (e.g., “Shake your head!”), and per- 
formed the action. This procedure was repeated until the infant 
also completed the action. 

Following the two training trials, infants completed six test trials 
(administered in a randomized order). For each trial, the research 
assistant held up a card and, while running her finger left to right 
under the text, provided a directive (e.g., “Do this one!”). If infants 
responded, they were given neutral feedback and moved on to the 
next trial. If infants failed to respond, the trial was repeated (up to 
a total of three repetitions). This procedure was then repeated for 
the remaining test trials. Trials were scored dichotomously (i.e., 
correct or incorrect) online. Additionally, 20% of video recordings 
were randomly selected to be independently coded by a second 
research assistant. Interrater agreement was 95.83%. 


Procedure 


Following baseline procedures, two trained research assistants 
visited treatment families in their homes. They introduced the 
intervention procedure, adapted from the directions provided by 
the baby media program, modeled its use, and answered any 
questions. The same researchers visited all families to ensure 
consistency of instruction. 

During the home visit, parents were provided with the first 
volume of intervention materials. The research assistant reviewed 
the instructions, suggesting that parents show their infants the first 
DVD two times per day for 30 days (i.e., 20 hr of exposure). They 
were encouraged to watch the DVD with their babies and point out 
words on the screen whenever possible. Parents were also in- 
structed to engage with their infants while using each of the other 
intervention materials (i.e., word cards, picture cards, and picture 
book) for 15 min per day (1.e., 7.5 hr of exposure per component) 
and to feel free to break the interactions up across the day if they 
found their child was unwilling to attend for the full time. Repe- 
titions of intervention materials were encouraged but not required. 
Finally, parents were provided with tips for supporting reading 
activities every day (e.g., playing matching games, pointing out 
rhyming words). Approximately 30 days after the home visit, 
families were mailed the second volume of intervention materials, 
along with a new set of instructions. Following the program’s 
guidelines, parents were instructed to use the second volume once 
per day for 60 days and to follow the previous protocol’s recom- 
mendation with each of the other materials. Additionally, parents 
were asked to show their children the previous DVD, as well as use 
each of the previous materials, once a week. 

By following this protocol designed by the program developers, 
over the course of the study, infants would receive 70 hr of DVD 
training and 45 hr of interacting with each of the other materials 
(i.e., word cards, picture cards, and picture books). Together, they 
would be exposed to 117 words in multiple formats. 

Fidelity of implementation. We developed a four-item 
checklist to capture the degree to which treatment families adhered 
to the instructions for implementing the baby media program. 
Using the program’s parent guide, we identified key features of the 
program to include on the checklist: (a) infant watched the DVD, 
(b) infant used the word cards, (c) infant used the picture cards, 
and (d) infant read the picture book. 
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Over the course of the study, a trained research assistant called 
treatment families on a bi-weekly basis. Parents were asked about 
their infant’s use of the baby media program on the previous day. 
Checklist items were adapted to specifically target the current 
volume of the program. Each feature was scored dichotomously as 
1 (.e., completed) or 0 (ie., not completed). Implementation 
varied across the four components. Families were mostly likely to 
use the DVD (M = 64.38%, SD = 29.56), followed by the picture 
book (M = 57.92%, SD = 28.99), and picture cards (M = 54.48%, 
SD = 27.64). They were least likely to use the word cards (M = 
26.90%, SD = 30.07). 

To examine whether enactment declined over the course of the 
study, we compared each family’s implementation checklists 
across the five volumes. Fidelity to the full baby media program 
(all elements together), as well as implementation of the picture 
book, picture cards, and word cards, remained consistent across 
volumes (all ps > .05S). However, parents reported significantly 
lower enactment of the final DVD compared with prior volumes 
(p = .014). 


Results 


In this section, we address the effects of the intervention on 
babies’ reading development. We begin by conducting descriptive 
analyses to examine the distributional properties of the data and to 
determine the equivalency of the treatment and control groups 
prior to further analysis. Subsequently, we use inferential statistics 
to examine the effects of the program. To test whether infants 
who used the components of the program more frequently were 
more likely to display early literacy skills than children who 

_used the program less frequently, we correlated fidelity with 
each of the outcome measures. There was no evidence that 
fidelity to the program supported any of the literacy skills (see 
Table 3). Further, we conducted all analyses with percentage of 
fidelity as covariate and found it to be nonsignificant for all 
measures. Therefore, we used one-way analyses of covariance 
(ANCOVA) with condition as the independent variable, and base- 
line as covariate when included in the task, to examine each phase 
of development in learning to read. Because of the age range 
among the children and the developmental differences between 
infants at various stages, we included age as a covariate in all 
nonstandardized analyses. Additionally, when available, we used 
pretest or baseline scores as covariates. 


Descriptive Analyses 


As shown in Table 1, there were no significant group differences 
in the child demographic data, HOME scores, or Bayley scores at 
the infants’ first visits. 

Treatment and control groups also did not differ initially on their 
previous media experience. Data for our sample are presented 
alongside national averages in Table 4. In general, children in our 
sample had less television exposure than the national average, had 
fewer televisions in their homes, and were less likely to have cable 
access than the average child (Rideout & Hammel, 2006). 

Contrastingly, infants in our sample in both treatment and con- 
trol groups were more likely to have shared-reading experience 
than the national average, with 75% of the parents reporting they 
read daily to their child (see Table 5). Despite having more regular 


Table 3 
Correlation Between Fidelity to the Intervention and 
Child Outcomes 





Outcome measure r P 
Expressive vocabulary knowledge E259) 056 
Receptive vocabulary knowledge =.053 710 
Letter-sound knowledge lS 465 
Letter-name knowledge 048 743 
Name recognition .200 168 
Print awareness—book orientation 075 .600 
Print awareness-text orientation Sl) .290 
Orthographic knowledge—mirror 102 482 
Orthographic knowledge-illegal character 108 456 
Sight word reading 283) .045 
Target vocabulary knowledge oT .102 
Directionality—words 039 186 
Directionality—sentences .170 233 
Decoding 2 356 
Expressive vocabulary knowledge nS aii 
Reading with meaning —.024 863 


Note. None of the measures had ps greater than the Bonferroni-corrected 
value of .003. 


reading sessions than the national average, infants in our sample 
appeared to be read to for slightly less duration than the average 
infant, with the most popular response for parents in our sample 
being 15-30 minutes, just below the national average of 33 min- 
utes per day. 


Pre-Alphabetic and Partial Alphabetic Phases 


Means and standard deviations for the pre-alphabetic and partial 
alphabetic phases of reading are presented in Table 6. As shown, 
although the means were slightly higher for the treatment group in 
the pre-alphabetic phase, there were no significant differences 
between groups on either of these measures. As evidenced by 
parental report, expressive vocabulary between groups was statis- 
tically equivalent. 

Speech processing efficiency. Similarly, there were no sig- 
nificant differences between groups in children’s ability to process 
speech. If the treatment had facilitated children’s processing of 
oral language, we would have expected children to demonstrate 
significantly lower time to first fixation than children in the control 
group. To measure such processing, we first checked to make sure 
that children did not have a preference for the target or foil picture 
prior to being prompted to look. In our task, seven target words 
passed this first criterion: bottle, bucket, car, dog, ear, giraffe, and 
tiger. Second, to ensure that children’s latency to fixate was based 
on actual processing of the verbal prompt, we needed to establish 
that children knew the words presented in the task. We confirmed 
this by asking parents to indicate which of the words their children 
understood and excluded trials for which we had no parent con- 
firmation. Each child’s latency to look to the target was then 
averaged across each of their eligible trials. There was no differ- 
ence between groups in latency to look at the target object, F(1, 
75) = 1.77, p = .188, n7 = .023. The age covariate was also 
nonsignificant, F(1, 75) = 1.33, p = .258, n> = .017. These results 
suggest that after more than a month’s intervention, there was no 
evidence of significant effects for treatment on children’s pre- 
alphabetic skills in the ability to process speech more efficiently. 
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Table 4 
Infants’ Television Exposure 
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a 


Treatment Control National 
Questionnaire item (n = 61) (n = 56) joe average 
No. of working televisions in the home (%) 
0 38 6.0 343 1%” 
ie3 90.0 91.0 75%? 
4 or more 6.7 5.4 24%> 
Home has cable/satellite television 63.3 69.6 472 80%? 
Infant has started watching television 56.7 42.9 nly 79%° 
Infant’s average daily television viewing 34 min* 
None » 64.4 60.7 446 
<= 1hr 25.4 33.9 
1-2 hr 10.2 5.4 
Parent talks with infant while co-viewing 
Never 14.3 20.5 mle, — 
Once in a while 16.3 10.3 
Frequently 37 38.5 
Almost always 16.3 30.8 
Infant has television in bedroom BS a5) Pai 19%° 
Infant has favorite TV program/DVD 41.7 28.6 140 _— 
Parent attitudes toward educational media® 2.67 (0.43) 2.34 (0.61) .002** 
Mostly helps? 38% 
Not much effect” 22% 
Mostly hurts” 31% 


* p reported for f¢ test or chi-square test of treatment versus control groups. 
Hammel (2006) for children ages 6 months—6-years. 
4 Mean value and standard deviation on scale of 1—4; neutral score is 2.5. 


children ages 6-months—23-months. 
Pipe Ol 


At the partial alphabetic phase, we found a similar pattern. 
Given the number of tasks at this phase, we would expect that the 
intervention might influence at least some of the initial speech-to- 
print skills that it presumably emphasizes throughout the program. 
However, as shown in Table 5, there were no apparent patterns of 
improvements in these skills, with the treatment group slightly 
ahead on four of the measures, and the control group slightly ahead 
on another four of the measures. None were statistically signifi- 
cant, as described in the following text. 

Receptive vocabulary knowledge. If children learned the 
vocabulary words presented in the baby media program, we would 
expect children in the treatment group to recognize more of the 
target vocabulary words taken from the program than those in the 
control group. A one-way ANCOVA indicated that there was no 


Table 5 
Infants’ Shared Reading Experience 


> Data reported by Rideout and 
© Data reported by Rideout and Hammel (2006) for 


difference between groups in proportion of time looking to the 
target photographs at baseline, F(1, 101) = 0.56, p = .458, n° = 
.005. The age covariate was also nonsignificant, F(1, 101) = 0.06, 
p = .805, n° = .001. There was also no difference between groups 
in the proportion of time looking to the target photographs after 
they were prompted, F(1, 102) = 1.32, p = .254, n? = .013. Age 
and baseline fixations were both nonsignificant covariates—age: 
F(1, 102) = 0.61, p = .439, n° = .006; baseline fixations: F(1, 
102) = 3.35, p = .070, yn? = .033. 

Letter-name and letter-sound knowledge. If exposure to 
the treatment enhanced letter knowledge, we would expect chil- 
dren in this condition to demonstrate significantly longer fixations 
to the target letter (compared with the foil letter) than children in 
the control group. One-way ANCOVAs with condition as the 





Percentage of those in 


Treatment group Control group National 
Questionnaire item (n = 61) (n = 56) De average” 
Infant has started being read to 98.4 100 2332 94%° 
Infant is read to daily led 78.6 232i 58%° 
Infant’s average daily reading duration 33 min 
Not at all eT 0.0 485 
A few minutes 25.0 3973 
At least 15 min 55.0 46.4 
More than 15 min 18.3 14.3 
Infant has favorite book 63.3 58.9 .627 n/a 





“p reported for chi-square test of treatment versus control groups. 


ages 6-months—23-months. 


> Rideout and Hammel (2006) for children 
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Table 6 


Descriptive Statistics for Pre-Alphabetic and Partial Alphabetic Measures 





Treatment Control 
Measure M (SD) M (SD) De 

Expressive vocabulary knowledge (Percentile score on the MacArthur CDI short form at home visit) 35.81 (30.17) 31.58 (29.33) 444 
Speech processing efficiency (Latency to look to target in seconds) 0.99 (0.68) 0.80 (0.65) 188 
Receptive vocabulary knowledge (Proportion of time spent looking to target) 

Baseline Si (iO) Soi) 458 

Test 59 (12) 55 (.14) 254 
Letter-sound knowledge (Proportion of time spent looking to target) 

Baseline 46 (.14) .49 (.14) 245 

Test 50 (.16) SONGS) .968 
Letter—-name knowledge (Proportion of time spent looking to target) 

Baseline .48 (.15) Sul GIS) 293 

Test 48 (.19) 44 (.22) .200 
Name recognition (Proportion of trials correct) 33119) 59 (.21) 216 
Print awareness (Orientation; proportion of times spent looking to target) 

Books 54 (.14) 56 (.12) 645 

Text AT (.24) AT (.27) 837 
Orthographic knowledge 

Proportion of time spent looking to target words) 

Mirror 2323) (28) 73 
Illegal character 38 (.15) 42 (.17) .203 

Looking time to illegal character (in seconds) 0.68 (0.81) 0.46 (0.38) 071 
Sight word reading (Proportion of time spent looking to target) 

Baseline 49 (.17) ey lh Gull) .666 

Test BS 21(@23)) .50 (.22) A413 





Note. MacArthur CDI short form = MacArthur—Bates Communicative Development Inventories—short form (Fenson et al. (2004). 


“p teported for analysis of covariance comparison of condition. 


independent variable and age as the covariate indicated that there 
was no difference between groups in proportion of time looking to 
the target letter at baseline for letter sounds, F(1, 100) = 1.37, p = 
.245, n°? = .014, or letter names, F(1, 100) = 1.12, p = .293, n? = 
.011. The age covariate was also nonsignificant in both tests— 
sounds: F(1, 100) = 0.02, p = .891, n* < .001; names: F(1, 100) 
0.55, p = .462, 77 = .006. This assessment demonstrated that 
children had no visual preference for one of the letters over the 
other prior to being prompted where to look. 

One-way ANCOVAs with age and baseline looking as covari- 
ates indicated that there was also no difference between groups in 
proportion of time looking at the target letter at test (after the 
prompt) when the letter sound was prompted, F(1, 99) = 0.002, 
p = .968, n° < .001. Both covariates were nonsignificant as 
well—age: F(1, 99) = 0.006, p = .937, n° < .001; baseline: F(1, 
99) = 0.63, p = .430, n° = .006. Additionally, there was no 
difference between groups in proportion of time looking at the 
target letter when the letter name was prompted, F(1, 95) = 1.67, 
p = .200, n> = .015. Age and baseline looking were both signif- 
icant covariates—age: F(1, 95) = 4.14, p = .045, “7 = 1038; 
baseline: F(1, 95) = 8.60, p = .004, n* = .078. Both groups’ 
proportion looking to the target was near chance (.50) for letter 
names and sounds. 

Name recognition. To test whether experience with the inter- 
vention improved children’s ability to recognize their written 
names, we analyzed children’s performance on four trials in which 
they were asked to select the toy with their name printed on it. 
Children received a proportion score for the number of trials in 
which they chose the correct object out of four. Both groups scored 
near chance (.50). An ANCOVA with age as the covariate indi- 
cated that there was no difference between the treatment and 


control groups, F(1, 98) = 1.55, p = .216, n* = .015. Age was 
also nonsignificant, F(1, 98) = 2.23, p = .139, p = .022. 

Print awareness. If experience with the intervention sup- 
ported children’s understanding of print orientation, we would 
expect children in the treatment condition to look longer at upright 
book covers and words than children in the control group. How- 
ever, there was no group difference in proportion of looking to the 
upright book covers (vs. the inverted ones), controlling for age, 
F(1, 95) = 0.21, p = .645, n° = .002. Age was a nonsignificant 
covariate, F(1, 95) = 0.045, p = -.833, n> = .001. There was also 
no difference in proportion of looking to the upright words, con- 
trolling for age, F(1, 90) = 0.04, p = .837, n° = .001. Age was 
again nonsignificant, F(1, 90) = 1.12, p = .294, "7 = 012s Both 
groups spent approximately half the time looking at the target and 
half the time looking at the foil on both tasks. 

Orthographic knowledge. Similarly, we would expect chil- 
dren in the intervention to begin to recognize what was or was not 
a “word.” We compared children’s proportion looking to two 
different types of standard and nonstandard pseudo-words. First, 
we looked at children’s preference for standard versus mirror- 
image-reversed copies of the same word. An ANCOVA with age 
as the covariate indicated that there was no group difference in 
proportion looking to the standard word over the reversed word, 
F(1, 93) = 0.32, p = .573, "7 = .003. Age was nonsignificant, 
F(1, 93) = 0.07, p = .794, n? = .001. Next, we investigated 
children’s preference for standard words versus those with illegal 
characters inserted (such as # or $). Again, there were no group 
differences, controlling for age, F(1, 94) = 1.64, p = .203, n* = 
017. Age was nonsignificant, F(1, 94) = 0.64, p = .426, n* = 
.007. However, we did note that children in the treatment group 
spent marginally more time (in raw seconds) fixated on the indi- 
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vidual illegal character than children in the control group, control- 
ling for age, F(1, 98) = 3.32, p = .071, n° = .033, indicating some 
recognition of irregularity. Age was nonsignificant, F(1, 99) = 
0.01, p = 937.7 =.001. 

Sight word reading. Finally, if children learned word—text 
mappings from exposure to the intervention, we would expect to 
children in the treatment group to demonstrate significantly longer 
fixations to target words than foil words when prompted. An 
ANCOVA with age as the covariate indicated that there was no 
group difference in proportion of looking to the target (vs. the foil) 
prior to prompting, F(1, 100) = 0.19, p = .666, n* = .002. Both 
groups spent about half the time looking to each side of the screen. 
These results indicated that children did not prefer to look at one 
picture over the other prior to being prompted where to focus. 

Baseline looking was significant, F(1, 98) = 6.14, p = .015, 
1° = .059. However, a one-way ANCOVA with age and baseline 
looking as covariates reported that there were no group differences 
in proportion of looking to the target after prompting, F(1, 98) = 
0.68, p = .413, n° = .007. Age was a nonsignificant covariate, 
Ed, 98) = 0000. 7 = .959 m= .001. 

In sum, these results showed no evidence of positive effects of 
the intervention on children’s pre-alphabetic or partial alphabetic 
phases of early literacy development. Using multiple measures 
designed to tap many different aspects of early development, we 
found no discernable significant differences between groups on 
word learning or the skills associated with reading development. 


Full Alphabetic and Consolidated Alphabetic Phases 


Although one might assume that the latter phases of reading are 
predicated on improvements in the earlier phases, and not likely to 
show evidence of impact on children’s reading development, we 
believed it was prudent to examine the skills associated with 
conventional reading for several reasons. First, there was a rea- 


Table 7 
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sonable expectation that reading development may not represent a 
process where one skill is prerequisite for movement to the next 
phase (Ehri & Roberts, 2006). None of the developmental theories 
of reading are so rigid. Second, although the instructional design of 
the program is based on an analytic phonics approach, it focuses 
mostly on sight word reading and associative learning with words 
and word meaning connections; and third, claims made by these 
programs argue for conventional and fluent reading. Therefore, in 
the final months, we examined the more conventional skills asso- 
ciated with the simple view (e.g., decoding and comprehension). 
Table 7 presents the means and standard deviations of measures 
that are representative of the transition to the consolidated phase 
and conventional reading. 

Directionality of print. Conceivably, the intervention should 
help children understand concepts of print, particularly the direc- 
tionality of text. Therefore, we would expect children in the 
treatment group to direct their gaze toward the beginning of words 
and sentences significantly more frequently than children in the 
control group. We counted the number of trials on which each 
child’s first look was to the upper left portion of the text. 
Because the data were very heavily skewed (most children did 
not look left), we opted to use a Kruskal-Wallis test instead of 
an ANCOVA. The Kruskal-Wallis test is the nonparametric al- 
ternative to the analysis of variance that is used when data are 
nonnormal (Rosner, 2011). It does not allow for the addition of a 
covariate, so we first checked for a correlation between age and the 
number of times children’s first look was on the left portion of text. 
There was no correlation between age and number of left looks for 
word slides or sentence slides. The Kruskal-Wallis test indicated 
that there were no group differences in the number of left looks for 
words (p = .652) or sentences (p = .838). 

Decoding. If children learned word—text mappings from Your 
Baby Can Read, we would expect children in the treatment group 


Descriptive Statistics for Full Alphabetic and Consolidated Alphabetic Measures 








Treatment Control 
Measure M (SD) M (SD) p* 
Target vocabulary knowledge (Proportion of target words 
child says) 58 (.33) 41 (.32) 000% 
Directionality of print (First look to correct position) 
Words 
Mean number eAGOND) .24 (.60) 1OD2) 
Range (max = 6) 0-4 correct 0-3 correct 
Sentences 
Mean number 59 (.83) a2) .838 
Range (max = 3) Q-3 correct Q-2 correct 
Decoding (Proportion of time spent looking to target) 
Baseline 50 (.16) 54 (.21) ESI, 
Test 47 (.20) 49 (.24) 744 
Expressive vocabulary knowledge (Percentile score on the 
MacArthur CDI short form at final visit) 46.8 (30.15) 40.3 (32.2) .289 
Reading with meaning (No. of behaviors performed) 
Familiar cues 4 0 .064 
Novel cues 2 1 OU, 
Note. MacArthur CDI short form = MacArthur—Bates Communicative Development Inventories—short form 


(Fenson et al. (2004). 


“p reported for analysis of covariance, Kruskal-Wallis, or Fischer’s exact comparisons of condition. 


deta 


p < .001, after controlling for age. 
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to fixate significantly more on the target words than children in the 
control group. A one-way ANCOVA, controlling for age, indi- 
cated no difference in children’s preference for the target over the 
foil in the baseline phase (prior to being prompted where to look), 
F(1, 100) = 0.86, p = .357, n° = .008. Age was also nonsignif- 
icant, F(1, 100) = 0.17, p = .679, n? = .002. This established that 
children had no visual preference for one object over the other. 
Baseline looking proportion was then entered along with age asa 
covariate in a one-way ANCOVA to test for group differences in 
looking to the target during the test phase (after the prompt was 
given). There was no group difference, F(1, 97) = 0.11, p = .744, 
1 = .0O1. Age was a nonsignificant covariate, F(1, 97) = 0.41, 
Pp = .523, "7 = .004. Baseline looking was significant, F(1, 97) = 
4.70, p = .033, n° = .046. This indicated that children’s visual 
preference in the baseline was the best predictor of their looking 
during test. 

Program vocabulary. Toward the end of the study, we asked 
parents in both groups to identify target words that their child 
could say, using a similar format as the MacArthur—Bates Com- 
municative Development Inventories, only with target words di- 
rectly from the intervention program. Comparing treatment and 
control groups, there was a significant main effect of condition on 
children’s expressive knowledge of words introduced in the pro- 
gram, F(1, 102) = 5.99, p = .016, n- = .055, after controlling for 
age. According to parent reports, children in the treatment group 
knew an average of 58% of the target words (SD = 33%), and the 
control group knew only 41% (SD = 32%). Age was a significant 
covariate, F(1, 102) = 39.46, p < .001, yn? = .279. 

Vocabulary knowledge. At the same time, we administered 
the MacArthur—Bates Communicative Development Inventories to 
both groups in this final phase of the study. Because the percentile 
ranking is based partially on children’s age, age was not used as a 
covariate in the analysis. Although children in the treatment group 
were reported to know more words as indicated by the higher mean 
percentile ranking (M = 46.8, SD = 30.15) than children in the 
control group (M = 40.3, SD = 32.2), this difference was non- 
significant, F(1, 104) = 1.14, p = .289, n* = .011. Therefore, if 
children in the treatment group were using more expressive vo- 
cabulary than those in the control group, it was likely due to their 
repeating the words that they heard and saw in the program and not 
due to a significant increase in vocabulary knowledge at large. 

Reading with meaning. In our final analysis, we examined 
children’s ability to read with meaning using written cue cards. 
Three of these cue cards contained exact phrases and actions 
taught in the program (e.g., “Clap your hands”), and the other three 
cards contained words used in the program but combined in novel 
ways (e.g., “Touch your face”). Children’s responses were video- 
taped and coded by the experimenter for whether the child pro- 
duced the requested action. 

There were a total of four positive responses to the familiar cues, 
all of which were produced by the treatment group. We performed 
a Fischer’s exact test to determine whether the proportion of 
responses in the treatment group was significantly greater than the 
null performance of the control group. We used a one-tailed test 
due to the nature of the directional hypothesis (i.e., babies’ being 
able to read), suspecting that it would be rare and that the opposite 
(i.e., babies’ failing to respond) might be more common. In other 
words, we did not expect an extreme contingency table indicating 
common positive (reading) responses. The one-tailed Fischer’s 


exact significance was p = .064, indicating that there was a 
nonsignificant association between positive responses to the pro- 
gram cue cards and condition. For the novel cues, there were only 
a total of three positive responses, two infants in the treatment 
group and the other in the control group. The one-tailed Fischer’s 
exact significance was p = .746, indicating no association between 
positive responses and treatment group. 

We then asked each of the children who had performed at least 
one of the reading behaviors to return to the lab after 6 months; 
five of the seven children returned for a visit. We repeated the 
same set of procedures and prompts as in the initial reading with 
meaning task. None of the children successfully completed any of 
the behavioral responses on the written cue cards. 

Finally, we examined the exit questionnaire responses of parents 
of children who had appeared to “read” cards on the task versus the 
rest of the sample on two exit items: “My child has started to learn 
to read,” and “My child knows how to read.” Kruskal-Wallis tests 
indicated that these parents had given their children higher ratings 
on both items, ps = .021 and .005, respectively. 

In sum, following the use of the intervention over a 7-month 
period, there was no evidence to indicate that babies in the treat- 
ment group could read or attend to words and texts any differently 
than children in the control group. Although parents in the treat- 
ment group reportedly indicated that children knew significantly 
more target words in the program than those in the control group, 
these gains were not evident in the standardized vocabulary mea- 
sure. Finally, those children who appeared to read and respond to 
written language cues seen on the DVDs were not able to identify 
words or phrases after the intervention was completed, despite the 
fact that their parents believed that they were beginning to read. 


Discussion 


The last decade has seen the explosion of baby media targeted 
to promoting infants’ development (Rideout & Hamel, 2006). 
Among the best-selling products are those that claim to teach 
babies to read. In their claims, program developers have argued 
that by accelerating the reading process, their products can enable 
young children to advance more rapidly in school, reading com- 
plex texts which would otherwise be taught in later grades. These 
developers implicitly make the case that by gaining more knowl- 
edge through text, a child can begin to read earlier and that the 
earlier a child can begin to read, the more proficient he or she will 
become in school and beyond. 

To our knowledge, this is the first study to fully test these 
claims. Purposely, we set out to be most scrupulous in our defi- 
nition of reading, examining outcomes that not only included 
conventional reading but ones that could tap the earlier precursors 
of literacy skills. To avoid the limitations of previous studies, we 
followed the program developers’ detailed protocol, trained par- 
ents in how to use these products, and conducted ongoing fidelity 
checks to measure and assess dosage. We used outcome measures 
that included parent reports and a series of eye-tracking tasks that 
could more sensitively gauge the subtleties of early orthographic 
and phonological knowledge. Finally, we conducted what is re- 
garded as the “gold standard” in research design, randomly assign- 
ing children and their families to treatment and control groups. 

Our results indicated that babies did not learn to read. In total, 
out of 14 different measures of early reading skills, there were 13 
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null findings. We saw no evidence for the effects on conventional 
reading, as program developers had indicated on their promotional 
websites and testimonials, or on any of the pre-alphabetic or partial 
alphabetic phases of reading. Even with a greater dosage of treat- 
ment than in previous studies, there were no effects of the inter- 
vention on children’s speech processing efficiency, word learning 
skills, phonological processing, orthographic knowledge, letter 
recognition, sight word reading, or reading with meaning. 

Nevertheless, based on our exit interviews and parent reports 
of target word learning (i.e., words seen on the DVD), there was 
the belief among parents that their babies’ were learning to read 
and that their children had benefited from the program in their 
expressive vocabulary development. Parents in the intervention 
reported that their infants used more of the targeted words from 
the program than those in the control group. This did not 
generalize, however, to the standardized measure of expressive 
word knowledge. In this case, there were no reported differ- 
ences between groups. 

These results suggest that parents may have interpreted imita- 
tion and mimicking as an indicator of word learning. A plethora of 
studies (e.g., Bandura’s classic bobo doll experiment; Bandura, 
1965; Neuman, 1991), for example, have shown children’s ability 
to mimic what they see on the screen. This mimicking phenome- 
non was clearly evident in several of the children’s responses to 
written commands on cue cards. Although four children initially 
responded to a phrase prompt directly from the program immedi- 
ately after the intervention, none were successful a few months 
later. Further, none of the children could respond to words in novel 
phrases. Consequently, testimonials and videos of reading on 
many of these websites may be preying on parents’ wishes and 
beliefs for their child’s precocity and not on the reality of what 
constitutes meaningful learning. Although we cannot say with full 
assurance that infants at this age cannot learn printed words, we 
can confidently say that they did not learn printed words from a 
product of this nature. 

Our findings provide further support for experimental studies 
that have demonstrated a lack of significant effects of baby media 
on receptive language. In two previous studies, for example, nei- 
ther Robb et al. (2009) nor DeLoache et al. (2010) reported a 
significant relationship between DVD viewing and infant receptive 
vocabulary. These results stand in contrast to those of Vandewater 
(2011) who reported significant positive effects on receptive vo- 
cabulary growth using a similar program. Comparing her results 
with these previous studies, Vandewater argued that the differ- 
ences in these findings could potentially be attributed to sample 
size or a Sleeper effect in which effects were found 3 months after 
the intervention. However, in our study, children were likely to be 
exposed to a far greater dosage than in Vandewater’s study (i.e., 
children in her study saw the video an average of 14 times, 
whereas ours were instructed to view the initial DVD at least 60 
times, with further redundancy built into all of the program mate- 
rials and repeated throughout the entire five-volume program) with 
no evident immediate or sleeper effects. Given that Vanderwater’s 
results relied on parental report of children’s language gains and 
not direct assessments with children, such findings might reflect a 
social desirability bias or wishful thinking on the part of parents. 
DeLoache et al. (2010), who reported both parent opinion and 
word learning directly assessed with the child, found that par- 
ents who enjoyed the DVD overestimated how many words 


their children learned from it. Although there is some research 
to suggest that young children are capable of learning individ- 
ual words from screen media sources like video (Allen & 
Scofield, 2010; Kremar, 2010; Kremar et al., 2007), it may be 
a poor substitute for language experiences with live speakers 
(Golinkoff & Hirsh-Pasek, 1999). 

At the same time, we did not find evidence for suppression of 
language scores, as reported in surveys and correlational stud- 
ies. Zimmerman, Christakis, and Meltzoff, (2007), for example, 
reported that watching baby videos predicted significant decre- 
ments in vocabulary size for infants between 8 and 16 months. 
In contrast, we found no declines in vocabulary for either group 
throughout our study. Similarly, Linebarger and Vaala (2010) 
have proposed that the expository content in baby videos (i.e., 
such as the program intervention in this study), compared with 
narrative or story-like content, may impair language develop- 
ment due to the volume of information presented in these 
programs. Once again, we found no evidence of decline in 
language or any other skill as a result of the intervention. 
Finally, media researchers have suggested that certain features 
like hearing verbal labels for objects visually depicted on- 
screen, verbal and visual emphasis on novel words and their 
referents, and repetition of verbalizations might support in- 
creased language acquisition (Krcmar, 2010; Naigles & May- 
eux, 2001). However, even with repeated exposure (i.e., 20 
words X 30 days X 4 types of media), we found little support 
for language acquisition or skill development from baby media. 

Rather, an alternative hypothesis is that babies are neither 
helped nor harmed by baby media. Clearly, infants can make sense 
of some video displays. For example, we could not have conducted 
our eye-tracking studies without some sustained attention to screen 
media. Moreover, studies comparing video instruction with no 
instruction at all have consistently shown that infants do learn 
some information from videos (Barr & Hayne, 1999; Strouse & 
Troseth, 2008). Further, there are a number of impressive studies 
suggesting that older infants and toddlers, given the right type of 
experiences, can bring their perceptual skills and general knowl- 
edge backgrounds to bear on a novel problem-solving tasks pre- 
sented through screen media (Nielsen, Simcock, & Jenkins, 2008; 
Troseth, Saylor, & Archer, 2006). Nevertheless, the absence of any 
significant effects on children’s language and skills suggests to us 
that the developmentalists are most accurate in their recognition of 
infant capabilities and limitations. Infants do bring a limited set of 
experiences and little background knowledge of the content and 
format used to deliver information on screen. As a result, they are 
in poor position to use screen media as a tool for learning how to 
read, given that language development and background knowledge 
are required for reading performance even at the initial level 
(Neuman, 2006). 

Ultimately, therefore, it is about choice. Parents must weigh 
whether such exposure to media is displacing other activities 
(Neuman, 1991), such as adult—child language interaction, reading 
books, play, and joint activity. These are the activities, shown 
through a large convergence of research (Neuman & Celano, 2012; 
Snow, Burns, & Griffin, 1998), that have strong empirical support 
on children’s affect, cognitive development, early reading skills, 
and, in the long run, reading performance. 
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Does Cognitive Strategy Training on Word Problems Compensate for 
Working Memory Capacity in Children With Math Difficulties? 


H. Lee Swanson 
University of California, Riverside 


Cognitive strategies are important tools for children with math difficulties (MD) in learning to solve word 
problems. The effectiveness of strategy training, however, depends on working memory capacity 
(WMC). Thus, children with MD but with relatively higher WMC are more likely to benefit from strategy 
training, whereas children with lower WMC may have their resources overtaxed. Children in Grade 3 
(N = 147) were randomly assigned to | of 4 conditions: (a) verbal strategies (e.g., underlining question 
sentence), (b) visual strategies (e.g., correctly placing numbers in diagrams), (c) verbal plus visual 
strategies, or (d) an untreated control. In line with the predictions, children with MD and higher WMC 
benefited from verbal or visual strategies relative to those in the control condition on posttest measures 
of problem solving, calculation, and operation span. In contrast, cognitive strategies decreased problem- 
solving accuracy in children with low WMC. Thus, improvement in problem solving and related 
measures, as well as the impairment in learning outcomes, was moderated by WMC. 


Keywords: math disabilities, strategy training, working memory 


The majority of the research on children who experience math 
difficulties (MD) has focused on processes related to calculation 
(Andersson, 2010; Geary, 2003, 2010; Gersten et al., 2009; Maz- 
zocco, Devlin, & McKenney, 2008; Stock, Desoete, & Roeyers, 
2010; Swanson & Jerman, 2006). More recent studies, however, 
have focused on children who experience difficulties solving word 
problems (e.g., Andersson, 2010; Fuchs, Zumeta, et al., 2010; 
Stock et al., 2010; Swanson, Jerman, & Zheng, 2008). This is an 
important focus because word problem solving constitutes one of 
the most critical mediums through which children can learn to 
select and apply strategies for coping with everyday problems. In 
addition, recent studies have shown that the cognitive processes 
involved in calculation difficulties are not the same processes as 
those involved in problem-solving difficulties (e.g., Fuchs et al., 
2008) and therefore call for unique interventions. In addition, some 
studies have shown that deficits in word problem solving are 
persistent across the elementary school years even when calcula- 
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tion and reading skills are at grade level (e.g., Swanson et al., 
2008). 

Recent intervention studies directed to improve problem-solving 
accuracy in children with MD have found support for teaching 
cognitive strategies. Several studies have found that verbal strategy 
instruction (e.g., Montague, 2008; Montague, Warger, & Morgan, 
2000; Xin, 2008), as well as visual-spatial strategies (e.g., Koll- 
offel, Eysink, de Jong, & Wilhelm, 2009; van Garderen, 2007), 
enhance children’s math performance relative to control conditions 
(see Baker, Gersten, & Lee, 2002; Gersten et al., 2009 for re- 
views). Several well-designed intervention studies (randomized 
clinical trials) have focused on high-risk samples. For example, 
Jitendra et al. (1998) used a visual categorization method to cluster 
arithmetic word problems (e.g., change, compare) that signifi- 
cantly improved problem-solving accuracy compared to the con- 
trol condition (effect size 0.45). Likewise, in a randomized control 
group design, Fuchs et al. (2003) taught problem-solving methods 
to children with MD and found that cognitive strategies (schema- 
based instructions) improved problem-solving accuracy (effect 
sizes ranged from 1.16 to 1.18 depending on the transfer measure). 
Additional successful strategy models have included diagramming 
(van Garderen, 2007), identification of key words (e.g., Mas- 
tropieri, Scruggs, & Shiah, 1997), and meta-cognitive strategies 
(e.g., Montague, 2008; see Gersten et al., 2009; Xin & Jitendra, 
1999, for reviews). These studies strongly suggest that the training 
of cognitive strategies facilitates problem-solving accuracy in chil- 
dren with MD. 

Despite the overall benefits of strategy instruction in remediat- 
ing word-problem-solving word difficulties, the use of strategies 
for some children with MD may not always be advantageous. 
From an aptitude-treatment perspective, not all children with MD 
may be expected to benefit from strategy training. In this study, I 
hypothesize that the availability of ample working memory re- 
sources is an important precondition in determining whether strat- 
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egy training will be successful. This is because strategies are 
resource demanding. As a consequence, children with relatively 
smaller working memory capacities (WMC) may be easily over- 
taxed by certain strategies, which may even lead to poor learning 
outcomes after training. This is because word problem solving is 
an activity that draws upon WMC to a considerable degree. Be- 
cause children with MD experience working memory difficulties 
(e.g., Swanson & Beebe-Frankenberger, 2004), children with low 
WMC may be unable to effectively benefit from cognitive strategy 
interventions. In contrast, children with MD who meet a certain 
threshold of (yet to be determined) WMC would have spare 
working memory resources to benefit from cognitive strategies. 
This hypothesis is in line with cognitive load theory (e.g., Sweller, 
1988, 2005), whose central tenet is that instruction should be 
designed in alignment with the learners’ cognitive architecture, 
which consists of a limited-capacity working memory system. 
Because information has to pass through working memory before 
it can be consolidated into long-term memory, the limited capacity 
of working memory can be considered the bottleneck for learning. 
Thus, individuals with MD but relatively higher WMC are better 
able than children with lower WMC to utilize cognitive strategies. 
This is because strategies rely on declarative representations and 
serial cognitive processes that require large amounts of WMC 
(e.g., Anderson, 1987), and the utilization of cognitive strategies 
that have been recently acquired imposes demands on WMC. In 
the context of this study, I define working memory as a processing 
resource of limited capacity that is involved in the preservation of 
information while simultaneously processing the same or other 
information (e.g., Baddeley & Logie, 1999; Engle, Tuholski, 
Laughlin, & Conway 1999). 

Although the above hypothesis is plausible, there are at least 
three alternative possibilities to explain the role of WMC and the 
utilization and training of cognitive strategies to enhance problem- 
solving accuracy in children with MD. First, individual differences 
in WMC may not moderate the use of cognitive strategies in 
children with problem-solving difficulties. Support for this posi- 
tion comes from studies showing that WMC is not predictive of 
problem-solving accuracy when basic skills, such as reading, are 
entered into the regression model (e.g., Lee, Ng, Ng, & Lim, 2004; 
Swanson, Cooney, & Brock, 1993). Further, it could be argued that 
strategy training is primarily directed at providing additional cues 
to facilitate the retrieval of computational information via com- 
prehension, and therefore word problem solving does not interact 
with WMC. Thus, WMC would not moderate the impact of strat- 
egy training on problem solving because the impact of strategy use 
during problem solving is through a route unaffected by WMC. In 
the second alternative, WMC operates as a general system that 
subsumes many higher and lower order processes related to word 
problem solving accuracy. That is, processes related to word 


problem solving (computation and comprehension) share re-— 


sources with working memory (e.g., Colom, Abad, Quiroga, Shih, 
& Flores-Mendoza, 2008; Engle, Cantor, & Carullo, 1992). For 
example, WMC is necessary in the solving of word problems to 
allow for (or provide resources for) the translation of numbers and 
text information into algorithms, as well as for the simultaneous 
storage of output from previous processing. Thus, WMC predicts 
problem-solving performance but does not directly interact with 
cognitive strategies in facilitating problem-solving performance. 
Thus, there is no direct moderation of strategy effects by WMC. 


In the final alternative, cognitive strategies compensate for the 
excessive processing demands placed on WMC due to the extra- 
neous load of the problem-solving task. For example, solving a 
word problem (e.g., “15 dolls are for sale, 7 dolls have hats. The 
dolls are large. How many dolls do not have hats?) involves a 
variety of mental activities. Children must access prestored infor- 
mation (e.g., 15 dolls), access the appropriate algorithm (15 minus 
7), and apply problem-solving processes to control its execution 
(e.g., ignore the irrelevant information). The multistep nature of 
word problems that requires the processing of both relevant and 
irrelevant propositions draws on WMC. Children with relatively 
low WMC prior to training may be more responsive to cognitive 
strategies because such strategies help them compensate for work- 
ing memory limitations. In contrast, children with relatively higher 
levels of WMC may experience a level of redundancy or unnec- 
essary processing related to strategy training that does not facilitate 
learning. Thus, this alternative hypothesis predicts that WMC 
moderates strategy outcomes, but the effects are different than the 
first hypothesis. The first hypothesis predicts that strategy effects 
are greatest for children high in WMC, whereas the latter hypoth- 
esis predicts that strategy training is more effective for children 
low in WMC. 

This study investigates the role of WMC in strategy training for 
children with MD, by comparing three cognitive interventions to 
boost word problem solving performance. Training provided ex- 
plicit instruction related to (a) verbal strategies that directed chil- 
dren to identify (e.g., via underling and circling) relevant or key 
propositions within the problems, (b) visual strategies that required 
children to place numbers into diagrams, or (c) a strategy condition 
that combined verbal and visual strategies. Also, because warm-up 
activities related to calculation have been found to be effective in 
problem-solving interventions, this component was also included 
in all strategy training sessions (e.g., Fuchs et al., 2003). In 
addition, consistent with literature reviews that have identified key 
components related to treatment effectiveness (Gersten et al., 
2009; Xin & Jitendra, 1999), each strategy training session in- 
volved explicit practice and feedback related to strategy use and 
performance. The strategy conditions also directed children’s at- 
tention to the relevant propositions within each word problem 
(Mayer & Hegarty, 1996). Additionally, embedded within each 
lesson were instructions to focus on relevant information for 
solution accuracy in the context of where there were increasing 
distractions related to number of irrelevant propositions within the 
word problems. This is an important component because difficul- 
ties in controlled attention have been found to underlie some of the 
cognitive deficits experienced by children with MD (e.g., Passol- 
unghi, Cornoldi, & De Liberto, 2001; Passolunghi & Siegel, 2001). 

In summary, this study addresses the question, What role does 
working memory capacity (WMC) play in strategy training out- 
comes for children with MD? Four prediction models based on the 
aforementioned hypotheses can be applied to strategy training 
outcomes for children with MD: (a) WMC as a limiting factor, (b) 
basic skills, (c) general resource, and (d) compensatory. The hy- 
pothesis that WMC serves as a limiting factor suggests that chil- 
dren with lower WMC benefit less from strategy conditions than 
children with relatively higher WMC. Thus, children with MD 
vary in their responsiveness to strategy instruction, and this is 
predicated on their WMC. In contrast, the basic skills model 
suggests that if declarative knowledge is intact (i.e., reading com- 
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prehension and computational knowledge are in the average 
range), strategy instruction provides a helpful procedure to solve 
word problems without making demands on WMC. This model 
suggests that cognitive instruction provides additional information 
over control conditions when basic skills (e.g., calculation, read- 
ing) are intact. For example, children with MD benefit from 
strategy instruction because they are less efficient than average 
achieving children in calculation and general problem solving. 
Thus, strategy instruction interacts with general math ability and 
not WMC. In contrast, the general resource model hypothesis 
predicts that individual differences in WMC are related to solution 
accuracy regardless of treatment conditions. The resource model 
predicts that because WM as a general system underlies several 
problem-solving tasks, WMC has a general effect (nontreatment- 
specific effect) on problem-solving outcomes. Finally, the com- 
pensatory model suggests that WM interacts with treatment out- 
comes. Cognitive training is viewed as reducing processing 
demands on children’s problem solving and therefore freeing 
additional resources to solve problems. The compensatory model 
predicts that children with low WMC are more likely than those 
with relatively higher WMC to place a greater reliance on strategy 
conditions. 

The first hypothesis predicts an interaction in favor of the high 
WMC group; the second predicts no significant involvement of 
WMC in strategy outcomes (no significant main effect or interac- 
tion); the third predicts a main effect for WMC but no interaction 
with strategy conditions; and the fourth predicts a significant 
WMC by cognitive strategy interaction in favor of children with 
low WMC. 


Method 


Participants 


Participants were 147 third-grade children from public school 
classrooms in the southwestern United States. The final selection 
was based on parent approval for participation and achievement 
scores.’ Of the 147 children selected, 74 were female and 73 were 
male. Ethnic representation of the sample was 83 Anglo, 30 
Hispanic, 13 African American, 8 Asian, and 13 mixed and/or 
other (e.g., Anglo and Hispanic, Native American). The mean 
socioeconomic status (SES) of the sample was primarily low SES 
to middle SES based on free and reduced lunch participation, 
parent education, and parent occupation. The schools provided the 
percentage of children within classroom on a free-lunch program 
but not for individual participants. Significant differences occurred 
across classrooms (N = 22) in terms of the percentage of free- 
lunch representation (percentages varied from 2% to 56% of the 
classes), x7(9, N = 22) = 73.62, p < .01. However because 
children were randomly assigned to treatments within classrooms, 
I assumed that SES was not a contributing factor to the treatment 
outcomes. Based on the school records, the sample was drawn 
randomly from classrooms that reflected low middle class to upper 
middle class. 

Definition of risk for math difficulties (MD). This study 
sought to identify children at risk for difficulties in problem- 
solving performance. There is some consensus among researchers 
that it is more appropriate to use a cutoff score on achievement to 
determine risk factors in math rather than a discrepancy between 


achievement and IQ. Therefore, this study uses a cutoff score on 
standardized math achievement tests. Because the majority of 
children were not diagnosed with specific learning disabilities in 
math, I utilized the term “at risk for math difficulties” to indicate 
math difficulty (MD). Because this study’s focus was on word 
problem solving difficulties, I examined children who performed 
in the lowest 25th percentile on norm-referenced word problem 
solving math tests over a 2-year period. The 25th percentile cutoff 
score on standardized achievement measures has been commonly 
used to identify children at risk (e.g., Fletcher et al., 1989; Siegel 
& Ryan, 1989). This procedure separated the sample into 59 
children with MD (25 female, 34 male) and 88 children without 
MD (50 female, 38 male). I chose to focus this intervention on 
children with MD at the third-grade level because this is when 
word problems are emphasized within the curriculum relative to 
the early grades. 

The cutoff criteria for defining children at risk for MD was a 
score between the 35th and 90th percentile on measures of fluid 
intelligence (Raven Colored Progressive Matrices Test; Raven, 
1976), reading (Test of Reading Comprehension, Word Identifi- 
cation subtest from the Wide Range Achievement Test (WRAT-3; 
Wilkinson, 1993), and calculation (subtests from the WRAT-3 and 
Wechsler Individual Achievement Test; Psychological Corpora- 
tion, 1992), in addition to a composite score at or below the 25th 
percentile (below a standard score of 90 or scale score of 8) on 
standardized word problem solving math tests. Children were 
considered at risk if they performed at or below the 25th percentile 
on two of the problem-solving subtests. The story problem subtests 
were taken from the Test of Math Ability (TOMA; Brown, Cronin, 
& McEntire, 1994) and KeyMath (Connolly, 1998). Table 1 shows 
the means and standard deviations for children with MD and 
average achievers. As shown in Table 1, performance on standard- 
ized measures of word problem solving accuracy for the MD 
sample was at or below the 25th percentile (scale score at or below 
8, standard score below 90), whereas the MD sample’s norm- 


‘The sample was selected from two charter schools as part of a large 
Board of Cooperative Educational Services (BOCES). The charter schools 
serve a large number of children with learning disabilities as well as a 
major clinical site for the special education local plan area (SELPA) that 
has a higher than average population of children with special needs (20% 
of the school population). This allowed selection of participants within 
each classroom to be randomly assigned to each treatment condition. 
Although 210 children participated in the study, final selection of the 
at-risk sample was further refined to children with 2 years of low problem- 
solving scores on district-wide tests but reading scores in the average 
range. Additionally, children whose total composite problem-solving 
scores (TOMA, KeyMath) were at borderline (i.e., 26th percentile) as 
being classified at risk for math difficulties (to be discussed) were excluded 
from the data analysis. It is also important to note that the labels assigned 
to the three strategy conditions were primarily derived from what was 
emphasized (e.g., diagrams were utilized in the visual condition on the 
assumption they were creating an external problem presentation), and 
therefore it is important to note that elements of both verbal and visual 
information occur in all the conditions. The verbal treatment conditions 
drew strategy steps and activities designed to cue attention based on the 
work of Montague, Warger, and Morgan (2000); Fuchs et al. (2004), and 
Jitendra et al. (1998), whereas the visual-spatial intervention drew upon the 
work of van Garderen (2007) and related studies using diagrams from the 
Singapore curriculum (e.g., Kolloffel et al., 2009; Looi & Lim, 2009; Ng 
& Lee, 2009). 
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Table | 


Classification and Pretest Scores for Children With MD and Average Achievers 


Children with MD 


Average achievers 





Measure Reliability N M SD N M SD F ratio 
Age 59 8.79 0.75 88 8.78 0.50 0.01 
Classification so 

TOMA-S 0.87 Bo 6.17 1.06 88 9.67 2.06 142.68"" 

KeyM_S 0.90 59 6.66 1.45 88 10.85 ZA 90.63""" 

Average 0.89 59 6.64 1.09 88 11.16 1.56 371.44 
Fluid intelligence 

Raven_S 0.91 51 97.81 12.83 84 107.61 (RIES 21.54™ 
Reading 

TORE "Ss 0.98 54 oe DD 84 11.42 1295 28.16" 

WRAT_S 0.81 58 98.55 i 88 110.5 11.67 42.30" 
Arithmetic 

WIAT_S 0.86 58 94.91 10.65 87 104.44 9.82 30.58" 
Working memory 

Concept 0.87 59 2.93 2.06 88 eS) 5.61 35.24™ 

Sent/Dig 0.86 59 4.79 3.45 88 9.13 5239 29 47 

Update 0.84 29 3.68 2.49 88 9.02 4.29 74.61™* 

Composite 0.85 5 —0.49 0.35 88 0.61 0.62 153-6ilige 
Pretest 

Problem solving 0.92 59 5.05 2.26 88 9.69 2.62 123.24" 

Calculation 0.93 9 24.02 2.66 88 26.48 BAG 19.10°** 

Operation span 0.87 58 3.84 3.42 88 5.09 4.67 3.05% 





Note. _S at the end refers to standard or scale score. MD = math difficulties; TOMA = Test of Math Ability; KeyM = KeyMath test; Raven = Raven 
Colored Matrices Test; TORC = Test of Reading Comprehension; WRAT = Wide Range Achievement Test; WIAT = Wechsler Individual Achievement 
Test; Concept = conceptual span; Sent/Dig = sentence/digit span; Update = updating measure; Problem solving = word problems solving subtest from 
the Comprehensive Test of Math Abilities (CMAT); Calculation = arithmetic calculation subtest from the WRAT-3. 
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referenced scores on calculation, reading comprehension, and fluid 
intelligence were above the 35th percentile. 


Design and Treatment Conditions 


Random assignment. Children were randomly assigned at 
the individual level within each classroom either to a control group 
(N = 39) or to one of three treatment conditions: verbal strategies 
(N = 37), verbal + visual strategies (diagramming; N = 35), and 
visual-strategies-only (diagramming; N = 36).” Although the par- 
ticipating children were randomly assigned to each of the different 
strategy conditions within classrooms, a number of other controls 
were built into the implementation of the intervention. For exam- 
ple, to control for the impact of the graduate student tutors who 
implemented the interventions, all tutors were randomly rotated 
across days of the week and across treatment conditions, so that no 
one intervention group received instruction from the same gradu- 
ate tutor each time (i.e., tutor 1 might have presented Strategy A in 
the morning time slot on Monday, but then tutor 2 presented the 
next Strategy A lesson to the same children during that time slot on 
Wednesday). When comparing demographics of the children ran- 
domly assigned to one of the four treatment conditions (verbal- 
only, verbal + visual, visual-only, control), no significant differ- 
ences emerged between conditions as a function of MD status, 
(ae N = 147) = 1.98, p > .05; gender, x73, N = 147) = 1.14, 
p > .10; or chronological age, F(3, 147) = 1.47, p > .0S. 

Common instructional conditions. All the participants inter- 
acted with their peers in their homerooms on tasks and activities 
related to the district-wide math school curriculum. The school- 
wide instruction across conditions was the enVisionMATH Learn- 


ing Curriculum (Pearson Publishers, 2009). The curriculum in- 
cluded visual representations to show how quantities of a word 
problem were related and general problem-solving steps. The 
general problem-solving steps in the teacher manual instructed 
teachers to have children (a) understand, (b) plan, (c) solve, and (d) 
look back. An independent evaluation (Resendes & Azin, 2008) 
indicated in random trials (teachers assigned randomly to treat- 
ment or control condition) that gains emerged in Grades 2 to 4, 
following guidelines outlined by the What Works Clearinghouse 
(2006) standards, with effect sizes relative to control condition in 
the 0.20 range. A number of the curriculum’s elements were also 
utilized in this study’s treatments (e.g., find the key word). How- 
ever, in contrast to the school district’s required instruction, this 
study’s treatment conditions directly focused on specific compo- 
nents of problem solving over consecutive sessions presented in a 
predetermined order. In addition, the lesson plans for the experi- 
mental condition focused directly on the propositional structure of 
word problems. 

Experimental conditions. Each experimental treatment con- 
dition included 20 scripted lessons administered over 8 weeks. 
Each lesson was 30 minutes in duration and was administered 
three times a week in small groups of four to five children. Lesson 
administration was done by trained tutors (doctoral-level graduate 
students and/or master’s-level research assistants). Children were 


* The uneven sample size reflects some small attrition in the sample as 
well as the removal of children not meeting the operational criteria (e.g., 
low reading scores) from the data analysis for defining the sample as at risk 
for MD. 
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presented with individual booklets at the beginning of the lesson, 
and all responses were recorded in the booklet. Each lesson within 
the booklet consisted of four phases: (a) warm-up, (b) strategy 
instruction, (c) guided practice, and (d) independent practice. 

The warm-up phase included two parts: calculation of problems 
that required participants to provide the missing numbers (9 + 2 = 
x,x + 1 = 6;x—S5S = 1), and a set of puzzles based on problems 
using geometric shapes. This activity took approximately three to 
five minutes to complete. 

The instruction phase lasted approximately five minutes. At the 
beginning of each lesson, the strategies and/or rule cards were 
either read to the children (e.g., “to find the whole, you need to add 
the parts”) or reviewed. Depending on the treatment condition, 
children were taught the instructional intervention (verbal strategy, 
diagramming, or verbal strategy + diagramming). The steps for 
the verbal-strategy-only approach included (a) find the question 
and underline it, (b) circle the numbers, (c) put a square around the 
key word, (d) cross out information not needed, (e) decide on what 
needs to be done (add/subtract/or both), and (f) solve it. For the 
visual-strategy-only condition (diagramming) children were taught 
how to use two types of diagrams. The first one represented how 
parts made up a whole. The second type of diagram represented 
how quantities are compared. The diagram consisted of two empty 
boxes, one bigger and the other smaller, in which children were to 
fill in the correct numbers representing the quantities. An equation 
with a question mark was presented. The question mark acted as a 
placeholder for the missing number provided in the box. Finally, 
for the combined verbal + visual (diagramming) strategy condi- 
tion, an additional step (diagramming) was added to the six verbal 
strategy steps described above. This step included directing chil- 
dren to fill in the diagram with given numbers and identify the 
missing numbers (question) in the corresponding slots in the 
boxes. 

The third phase, guided practice, lasted 10 minutes and involved 
children working on three practice word problems. Tutor feedback 
was provided on the application of steps and strategies to each of 
these three problems. In this phase, children also reviewed exam- 
ple problems from the instructional phase. The tutor assisted 
children with finding the correct operation, identifying the key 
words, and providing corrective feedback on the solution. 

The fourth phase, independent practice, lasted 10 minutes and 
required children to independently (without feedback) answer an- 
other set of three word problems. If the child finished the inde- 
pendent practice tasks before the 10 minutes were over, he or she 
was presented with a puzzle to complete. The child’s responses 
were recorded for each session to assess the application of the inter- 
vention and problem-solving accuracy. For the visual-strategy-only 
condition, points were recorded for choosing the correct diagram, 
filling in the numbers correctly for the diagram, identifying the 
correct operations, and solving the problem correctly. For the 
verbal + visual strategy condition, points were recorded for choos- 
ing the correct diagram, inserting correct numbers, applying strat- 
egies, identifying the correct operations, and solving the problem 
correctly. For the verbal-strategy-only condition, points were re- 
corded for identifying the correct numbers, applying strategies 
(e.g., underlining), identifying the correct operations, and solving 
the problem accurately. 

Sentence demands. Word problems for each independent 
practice session included three parts: question sentences, number 


sentences, and irrelevant sentences. For each problem in the inde- 
pendent practice session, at least two number sentences were 
relevant to problem solution and one sentence served as the ques- 
tion sentence. The number of sentences, however, gradually in- 
creased across the training sessions. The numbers of sentences 
were as follows: Lessons 1 through 7 focused on identifying 
critical information for word problems four sentences long with 
one irrelevant sentence, Lessons 8 and 9 focused on five-sentence- 
long word problems with two irrelevant sentences, Lessons 10 
through 15 focused on six-sentence-long word problems with three 
urelevant sentences, Lessons 16 and 17 focused on seven- 
sentence-long word problems with four irrelevant sentences, and 
Lessons 18 through 20 focused on eight-sentence-long word prob- 
lems with five irrelevant sentences. 

Treatment fidelity. Independent evaluations were adminis- 
tered in order to determine treatment fidelity. During all lesson 
sessions, tutors were randomly evaluated by two independent 
observers (a postdoctoral student, a nontutoring graduate student, 
and/or the project director). The observers independently filled out 
evaluation forms covering all segments of the lesson intervention. 
Points were recorded on the accuracy with which the tutor imple- 
mented the instructional sequence based on a rubric. Observations 
of each tutor occurred for six sessions randomly distributed across 
instructional sessions. Interrater agreement was calculated on all 
observation categories. The mean percentage of interrater agree- 
ment across all sequences and conditions for each step of strategy 
implementation (10 observable items were coded) was 98% (SD = 
.41). Mean percent fidelity ratings by strategy conditions were 
100.00 (SD = 0), 97.05 (SD = 4.69), and 97.36 (SD = 4.52) for 
verbal-only, verbal + visual, and visual-only, respectively. 


Tasks and Materials 


The battery of group and individually administered tasks is 
described below. Experimental tasks are described in more 
detail than published and standardized tasks. Tasks were di- 
vided into classification, pretest-only, and pretest/posttest mea- 
sures. The sample reliabilities (Cronbach alpha) for each mea- 
sure are shown in Table 1. 


Classification Measures 


Fluid intelligence. The Raven Colored Progressive Matrices 
(Raven, 1976) was administered to determine if all children were 
within the normal range on a measure of fluid intelligence. Chil- 
dren were presented patterns displayed on each page, with each 
pattern revealing a missing piece. For each pattern, six possible 
replacement pattern pieces were displayed. Children were required 
to circle the replacement piece that best completed the pattern. The 
dependent measure (raw score range 0 to 36) was the number of 
problems solved correctly, which yielded a standardized score 
(M = 100, SD = 15). 

Word problems. Two measures were administered to assess 
word problem solving ability. The word problem subtests from the 
Test of Math Ability (TOMA-2; Brown et al., 1994) and KeyMath 
(KEYM; Connolly, 1998) were administered. Subtests from these 
measures yielded a scale score (M = 10, SD = 3). 
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Reading Skills 


Several studies have found that working memory is unrelated to 
word problem solving accuracy when reading proficiency scores 
are entered in a regression analysis (Ng & Lee, 2009; Swanson et 
al., 1993). Thus, it was necessary to administer reading measures 
at pretest because of the potential to moderate treatment outcomes. 

Word recognition. Word recognition was assessed by the 
reading subtest of the WRAT-3 (Wilkinson, 1993). The task pro- 
vided a list of words of increasing difficulty. The child’s task was 
to read the words until 10 errors occurred. The dependent measure 
was the number of words read correctly. 

Reading comprehension. Reading comprehension was as- 
sessed by the Passage Comprehension subtest from the Test of 
Reading Comprehension (TORC-II]; Brown, Hammill, & Weider- 
holt, 1995). The purpose of this task was to assess the child’s 
comprehension of topic or subject meaning during reading activ- 
ities. Comprehension questions were drawn from the reading of 
short paragraphs. The dependent measure was the number of 
questions answered correctly. 

Arithmetic calculation. Because the focus in this study was 
on problem solving and not calculation per se, I tested whether the 
children with and without MD were in the normal range on 
calculation skills. The arithmetic subtest from the Wechsler Indi- 
vidual Achievement Test (WIAT; Psychological Corporation, 
1992) was individually administered. This task required the written 
computation of problems that increased in difficulty. Problems 
began with simple calculations (2 + 2 =) and worked up to more 
elaborate algebraic calculations. The dependent measure was the 
number of problems correct, which yielded a standard score (MV = 
100, SD = 15). 


Working-Memory Capacity 


Several studies have shown individual differences in working 
memory span (referred to here as working memory capacity; 
WMC) play a major role in problem-solving performance (e.g., 
Swanson et al., 2008). Thus, I measured WMC to determine its 
effect on solution accuracy as a function of treatment conditions. 
The WMC tasks required children to hold increasingly complex 
information in memory while responding to a question about the 
task. The questions served as distractors to item recall because they 
reflected the recognition of targeted and closely related nontar- 
geted items. A question was asked for each set of items, and the 
tasks were discontinued if the question was answered incorrectly 
or if all items within a set could not be remembered. For this study, 
two WM tasks were administered (conceptual span and sentence/ 
digit task) that followed this format. A separate WM task, referred 
to as updating, was also administered. The WMC score was the 
composite z score that was formed by averaging across the three 
measures discussed below. 

Conceptual span task. The purpose of this task was to assess 
the participant’s ability to organize sequences of words into ab- 
stract categories (Swanson, 1992, 1995). The participant was pre- 
sented a set of words (one every 2 seconds), asked a discrimination 
question, and then asked to recall the words that “go together.” For 
example, a set might include the following words: shirt, saw, 
pants, hammer, shoes, nails. Children were directed to retrieve the 
words that “go together” (1.e., shirt, pants, and shoes; saw, ham- 
mer, and nails). The discrimination question was “Which word, 


‘saw’ or ‘level,’ was said in the list of words?” Thus, the task 
required participants to transform information encoded serially 
into categories during the retrieval phase. The range of set diffi- 
culty was two categories of two words to five categories of four 
words. The dependent measure was the highest set recalled cor- 
rectly (range of 0 to 8) in which the process question was answered 
correctly. 

Sentence/digit span. This task assesses the child’s ability to 
remember numerical information embedded in a short sentence 
(Swanson, 1992, 1995). Before stimulus presentation, the child 
was shown a card depicting four strategies for encoding numerical 
information to be recalled. The pictures portrayed the strategies of 
rehearsal, chunking, association, and elaboration. The experi- 
menter described each strategy to the child before the administra- 
tion of targeted items. After all strategies were explained, the child 
was presented numbers in a sentence context. For example, Item 3 
states, “Now suppose somebody wanted to have you take them to 
the supermarket at 8 6 5 1 Elm Street.” The numbers were 
presented at 2-s intervals, followed by a process question (i.e., 
“What was the name of the street?”). Then, the child was asked to 
select a strategy from an array of four strategies that represented 
the best approximation of how he or she planned to practice the 
information for recall. Finally, the examiner prompted the child to 
recall the numbers from the sentence in order. No further infor- 
mation about the strategies was provided. Children were allowed 
30 seconds to remember the information. Recall difficulty for this 
task ranged from 3 digits to 14 digits. The dependent measure was 
the highest set correctly recalled (range = O-9) in which the 
process question was answered correctly. 

Updating. Because WM tasks were assumed to tap a measure 
of controlled attention referred to as updating (e.g., Miyake, Fried- 
man, Emerson, Witzki, & Howerter, 2000), an experimental up- 
dating task, adapted from Morris and Jones (1990), was also 
administered. A series of one-digit numbers were presented that 
varied in set lengths of nine, seven, five, and three. No digit 
appeared twice in the same set. The examiner told the child that the 
length of each list of numbers might be three, five, seven, or nine 
digits. Children were then told that they should recall only the last 
three numbers presented. The digits were presented at approxi- 
mately 1-second intervals. After the last digit was presented the 
participant was asked to name the last three digits in order. In 
contrast to the aforementioned WM measures, which involved a 
dual-task situation where children answered questions about the 
task while retaining information (words or spatial location of dots), 
the current task involved the active manipulation of information 
such that the order of new information was added to or replaced 
the order of old information. That is, to recall the last three digits 
in an unknown (N = 3, 5, 7, 9) series of digits, the children must 
keep the order of old information available (previously presented 
digits) along with the order of newly presented digits. The depen- 
dent measure was the total number of sets correctly repeated 
(range 0 to 16). 


Pretest and Posttest Measures 


Word problem solving accuracy (CMAT). Because children 
were classified as at risk for MD on the TOMA and KeyMath, a 
separate norm-referenced measure of word problem solving accu- 
racy was individually administered at pretest and posttest: the 


COGNITIVE STRATEGY TRAINING AND MATH DIFFICULTIES 837 


Story Problem subtest from the Comprehensive Mathematical 
Abilities Test (CMAT; Hresko, Schlieve, Herron, Swain, & Sher- 
benou, 2003). The technical manual for this subtest reported ade- 
quate reliabilities (>.86) and moderate correlations (>.50) with 
other math standardized tests (e.g., the Stanford Diagnostic Math- 
ematics Test). The test included story problems that increased in 
solution difficulty. Two forms of the measures varied only in 
names and numbers. The two forms were counterbalanced across 
presentation order. 


Transfer 


I was interested in how well treatment effects would transfer to 
other tasks besides the problem solving measure (CMAT). I se- 
lected two tasks that I assumed tapped into near transfer and far 
transfer. The near transfer task (defined as tasks that matched the 
focus of intervention) focused on calculation. I assumed that 
because the children in each training session were receiving prac- 
tice in calculation solution accuracy and that this skill was closely 
aligned with the intervention, some increases in computation ac- 
curacy were to be expected. For the far transfer measure (task not 
directly related to the focus of treatment), I assessed improvements 
in working memory on the operation span measure. The measures 
involved holding in working memory both verbal and compu- 
tation information of increasing difficulty. Therefore, I assumed 
that because the cognitive strategy instruction in this study inte- 
grated verbal and calculation information, some transfer on the 
aforementioned measures might occur. 

Calculation. The arithmetic subtest from the WRAT-3 (Wilkin- 
son, 1993) was individually administered. Two forms of the tests 
were counterbalanced across children at pretest and posttest. The 
subtests required written computation to problems that increased in 
difficulty. 

Operation span. A version of the Turley-Ames and Whitfield 
(2003) operation span task, modified for children (Swanson, 
Kehler, & Jerman, 2010), was individually administered at pretest 
and posttest. Two identical forms were created and counterbal- 
anced for presentation order. The operation span test assessed WM 
span by having children solve simple math problems (e.g., 2 + 
3 =, 4 — 1 =) while also remembering unrelated to-be- 
remembered words (e.g., car, pencil) that followed each math 
problem. Operation-word sequences increased in set size. Children 
completed two practice trials with a set size of two. Children were 
then presented with operation-word sequences in sets of two, three, 
four, and five, with two trials for each set size for a total of 10 sets. 
Two versions of test stimuli (form A and form B) were counter- 
balanced for presentation order. Children received points toward 
their span score for correctly solving the math problems, for the 
number of correctly recalled words, and for correct order of word 
recall. 


Statistical Analyses 


Children were drawn from 22 third-grade classrooms. Because 
the data reflected treatments for children within classrooms, a 
mixed analysis of covariance (ANCOVA) model was necessary to 
analyze treatment effects. The fixed and random effect parameter 
estimates were obtained using PROC MIXED in SAS 9.3. The 
primary model used in this study was a 2 (MD vs. average 


achievers) X 4 (treatment) mixed ANCOVA. The covariates for 
the mixed ANCOVA were the continuous variables of pretest and 
working memory capacity. 

All four treatments were administered within each of the 22 
classrooms. Within each classroom, children were randomly as- 
signed to treatments. Tutors (V = 17) were crossed across class- 
rooms and treatments. That is, except in the control condition, all 
tutors took turns administering each treatment condition within 
each classroom. This rotation of tutors was done to ensure that 
posttest outcomes were related to the treatment procedures rather 
than the tutors assigned to administer the treatments. The formula 
for the cross-classification intercept-only model was as follows 
(see Hox, 2010, Chapter 9, for a review): 


Yigxy = Bog + eign &igxy ~ NO, 0°) (1) 


where Y;,;,) is problem-solving accuracy at the posttest for pupil i 
within the cross-classification of tutor (j), classroom (k), and 
treatments was modeled by the intercept (the overall mean) B,,;x) 
for the specific combination of tutor and classroom and a residual 
error term é€;;,, The subscripts (jk) are written within parentheses 
to indicate they are conceptually at the same level: the jk tutor/ 
classroom combination in the cross-classification of tutor and 
classroom. The subscripts (jk) indicated that the intercept B,,;,) 
varied independently across both tutor and classroom. The error 
term, €;;,), reflects the deviation of the child’s ,;,) score from the 
cell mean. These deviations were assumed to be normally distrib- 
uted with mean 0 and a within-cell variance o* 

Thus, I modeled the intercepts with the second-level equation: 


Bogie) Sao 7 Voj + Vok (2) 


In Equation 2, v,; is the residual term for tutors, and v,, is the 
residual term for the classroom. After substitution, this produces 
the intercept-only model as 


Yigo = Yoosh Yo t Vor Sig (3) 


I added to this model (Equation 3) the categorical variable of 
treatment and math ability as well as the continuous variables of 
pretest and WMC scores as covariates as well as the interactions. 


Yigjx) = Yoo + Yio(treatment) + yzo(math ability) + y30(pretest) 
+ Y4q(WMC) + Y4o(WMC * treatment) 
aia ce eneeNe (two-way and three-way interactions) 
Is; Vek Cigiy, (4) 


The intraclass correlations when predicting posttest scores with 
only random effects (tutor and classroom X tutor) were as follows: 
ps = .15 (1§ = 0, tt = .17, o* = .92) for problem-solving 
accuracy, ps = .35 (1) = .001, t} = .55, 0? = .98) for calculation 
accuracy, and ps = 0 (1§ = 0, tt = negligible, o* = .93) for 
operation span. The intraclass correlations for predicting posttest 
scores were reduced to 0 for problem-solving accuracy, .15 for 
calculation accuracy, and 0 for operation span when treatment 
conditions, ability group, WMC, pretest scores, and interactions 
were entered into the full model. 

Because the cells are unbalanced, a Kenward—Roger correction 
was used to obtain degrees of freedom. A full maximum-likelihood 
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(ML) estimation was used to compute the parameters at posttest 
because of some attrition in sample size (Widaman, 2006). 


Results 


Table | provides the means and standard deviations for the 
classification and criterion (pretest) measures. The F ratios 
comparing the ability groups prior to treatment assignment and 
the sample reliability of the measures are also reported in Table 1. 
Sample sizes, pretest scores, and posttest scores for children with 
and without MD as a function of treatment conditions are reported 
in Table 2. 


Pretest 


Criterion measures. Because children were randomly as- 
signed to each condition within classrooms, it was necessary to 
determine if potential biases in treatment assignment emerged at 
pretest. The criterion measures used to assess treatment effects 
were word problems from the CMAT, arithmetic problems from 
the WRAT-3, and recall scores from the operation span measure. 
Equivalent forms were developed for each measure, and the pre- 
sentation orders were counterbalanced across treatment conditions. 

A 2 (MD vs. average achievers) X 4 (treatment condition) 
mixed analysis of variance (ANOVA) was computed on pretest 
scores. The random effects included variance related to the afore- 
mentioned intercepts for tutors and classroom assignment. As 
expected, the main effect was significantly in favor of average 
achieving children when compared to children with MD on mea- 
sures of pretest problem-solving accuracy, F(1, 147) = 129.44, 
p < .001; pretest calculation accuracy, F(1, 147) = 18.89, p < 
.001; and pretest operation span accuracy performance, F(1, 
143) = 4.80, p = .03. The main effects for treatment conditions 
were not significant, however, for pretest problem solving, F(3, 
147) = 0.59, p = .62; pretest calculation, F(3, 147) = 0.75, p = 
.52; or pretest operation span performance, F(3, 143) = 1.46, p = 
.33. In addition, no significant effects occurred for the ability 
group X treatment interactions on measures of pretest problem 
solving, F(3, 147) = 2.07, p = .11; pretest calculation, F(3, 
147) = 0.75, p = .52; or pretest operation span performance, F(3, 
141) = 2.06, p = .11. A mixed 2 (ability group) X 4 (treatment) 
ANOVA was also computed on the WMC composite scores. A 
significant effect was found for ability group, F(1, 147) = 161.08, 
p < .001, but not for the main effect of treatment, F(3, 147) = 
0.92, p = .43, or for the ability group X treatment interaction, F(3, 
147) = 0.31, p = .82. 

Classification. The two ability groups were compared across 
treatment conditions on the classification measures. A 2 (ability 
group = MD vs. average achievers) X 4 (treatment) multivariate 
analysis of variance (MANOVA) was computed on classification 
measures of problem solving (TOMA, KeyMath), reading (TORC), 
fluid intelligence (Raven Colored Matrices Test), and arithmetic 
calculation (WIAT). The MANOVA was significant for ability 
group, Wilks’s A = .48, F(4, 114) = 19.00, p < .001, but not for 
treatment, Wilks’s A = .74, F(12, 301) = 1.27, p = .23, or for the 
ability group X treatment interaction, Wilks’s A = .79, F(12, 
301) = 1.17, p = .30. As expected, all univariate tests of signif- 
icance were statistically significant in favor of the average achiev- 
ers when compared to children with MD (see Table 1; all ps < 


.05). It is important to note, however, that although fluid intelli- 
gence, reading, and calculation scores were in the normal range 
for children with MD, average achieving children yielded 
higher scores than children with MD on these measures (see 
Table 1). 

For the next series of analyses, posttest criterion measures were 
converted to z scores based on the total sample means and standard 
deviations at pretest. This conversion allowed for comparisons 
across various dependent measures, as well as the identification of 
outliers (absolute z score > 3.5). No outliers were identified. 


Posttest CMAT Solution Accuracy 


A 2 (ability group) X 4 (treatment) mixed ANCOVA was 
computed on posttest z scores. The covariates for the analyses were 
the continuous variables of pretest CMAT solution accuracy and 
WMC. The mixed ANCOVA yielded significant effects for the 
ability group X treatment interaction, F(1, 147) = 3.21, p = .02; 
the WMC treatment interaction, F(3, 147) = 8.10, p < .001; the 
ability group X treatment WMC interaction, F(3, 147) = 3.51, 
p = .02; and the pretest CMAT score, F(1, 147) = 205.71, p < 
.00. No other significant effects occurred (ps > .05). For example, 
no significant main effects occurred for ability group, F(1, 147) = 
1.42, p = .23, or treatment condition, F(3, 147) = 0.74, p = .53. 
The adjusted posttest z score means for each treatment condition as 
a function of ability group are shown in Table 3. 

Because of the significant ability group X treatment * WMC 
interaction, a series of follow-up tests was conducted. The first 
follow-up determined if the slopes varied significantly between 
treatments as a function of ability group. The slopes for each 
treatment as a function of ability group are shown in Table 3. As 
shown, slopes were significantly higher for average achieving 
children when compared to children with MD within the verbal + 
visual condition, (147) = —1.96, p = .05, and the control con- 
dition, (147) = —2.57, p = .01. However, slopes were signifi- 
cantly higher for children with MD when compared to average 
achieving children within the visual-only condition, t(147) = 2.40, 
p = .02. No significant ability group differences in slopes occurred 
within the verbal-only condition, (147) = 1.01, p = .31. The 
slopes for the control condition were next compared to the other 
treatment conditions within each ability group. When compared to 
those for children with MD in the control condition, the slopes 
were significantly larger for children with MD within the verbal- 
only condition, t(147) = 2.48, p = .01, and visual-only condition, 
t(147) = 3.59, p < .001, but not when compared to children with 
MD within the verbal + visual condition, (147) = .14, p = .88. 
When compared to those for average achieving children in the 
control condition, the slopes were significantly larger for average 
achieving children within the verbal-only condition, t(147) = 2.64, 
p = .009; verbal + visual condition, 11147) = 2.24, p = .03; and 
the visual-only condition, 1(147) = —3.66, p = .008. 

Figures la and 1b show the linear regression line for each 
treatment condition as a function of WMC on the adjusted posttest 
solution accuracy scores for children with and without MD, re- 
spectively. As shown, posttest scores as a function of treatment 
(i.e., verbal-only and visual-only treatment conditions) were clearly 
divergent from the control condition, as WMC z scores approached 
approximately —1.0 for children with MD (see Figure 1a), 
whereas clear treatment divergence from the control condition 
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Table 2 


Means and Standard Deviations as a Function of Treatment and Ability Groups 
a err ae Ee gs ee 


Verbal-only Verbal + visual Visual-only Control 


Variable N M SD N M SD N M SD N M SD 
a a rE ee ee a ee ae a a ee 


Children with MD 





Age 12 8.79 0.79 17 8.86 0.70 14 8.55 0.52 16 8.92 0.93 
Classification 

TOMA-S 12 5.92 1.16 16 6.06 1.29 14 6.21 0.89 16 6.44 0.89 

KeyM-S i 7.00 0.89 17 6.44 0.73 14 6.25 2.05 16 7.60 0.89 

Math 12 6.42 1.38 Vy 6.47 1.12 14 6.79 0.8 16 6.88 1.09 
Fluid intelligence : 

Raven-S 10 95.97 19.75 14 100.58 12.13 3 97.27 11.41 14 96.85 ORS) 
Reading 

TORC-S 10 10.70 TT 17 9.00 1.80 12 9.50 2.81 15 9/33 2.26 

WRAT-S 12 100.00 7.69 17 98.18 12°53 13 98.77 9.96 16 97.69 8.96 
Arithmetic 

WIAT-S 12 98.00 11.88 17 92.65 12.45 13 95.08 8.39 16 94.88 9.63 
Working memory 

Concept_R 12 2.00 1.28 17 Berl 2.64 13 3.00 es 16 OS) 2.24 

Sent/Dig_R 12 5.08 3.42 17 5.47 3.26 iB 4.54 BES) 16 4.06 Bag 

Update_R 12 4.58 3.42 17 3:39 1.73 14 3.00 1.92 16 3.94 ET 

WMC a —0.47 Ome 7 —0.40 0.30 14 —0.60 0.45 16 a () 0.31 
Pretest 

Problem solving 12 5.00 1.81 Wi 4.65 1.93 14 4.50 2.62 16 6.00 2.45 

Calculation_R 12 24.50 3.21 17 23.47 3.14 14 23.14 Leg 16 25.00 218 

Operation span_R 12 1.92 1.44 16 4.38 3.76 14 4.36 3.95 16 4.31 3.38 
Posttest 

Problem solving iP 6.33 2.10 7] 6.47 2.83 13 5.62 Ne) 16 ees) D9) 

Calculation_R 12 26.00 2.70 7) 25.65 2.50 13 24.46 3.36 16 26.44 3:29 

Operation span_R 12 3.83 3.01 17 5 3.08 14 6.36 4.01 16 SS) 2.86 

Average achievers 

Age 25 8.77 .60 18 8.88 50 22 8.66 36 22 8.83 48 
Classification 

TOMA-S 25 9.88 DEST 18 9.56 1.69 22 9.32 1.96 22 9.82 1.87 

KeyM-S 25 10.77 DANG 18 9.67 1.86 22 12.07 1.98 Pp) 10.15 2.23 

Math_S 25 11.40 1.41 18 10.28 23 22 11.91 1.69 22 10.86 SD 
Fluid intelligence 

Raven-S 24 110.91 9132 16 110.05 8.95 22 107.83 10.56 21 101.5 13193 
Reading 

TORC-S 24 11.92 2.00 17 11.18 1.70 21 11.57 1.63 21 eS 1.91 

WRAT-S DS 108.84 11.06 18 110.94 10.26 22 112.32 11.81 DD) 110.23 13.88 
Arithmetic 

WIAT-S 25 104.60 53 17 102.53 8.25 22 107.05 6.83 22 103.05 11.49 
Working memory : 

Concept_R 25 5.60 3.99 18 6.89 3.14 a2 9.86 7.63 22 7.86 5.95 

Sent/Dig_R 25 10.08 5.69 18 10.67 5.65 22 eee 4.7 i, 8.68 aa2, 

Update_R 2S 9.48 4.81 18 10.44 3.81 D2, 8.23 4.0 22 7.91 4.12 

WMC* Des, 0.54 0.62 18 0.77 0.57 22. 0.63 0.77 22 0.52 0.51 
Pretest 

Problem solving R 25 10.48 DBS 18 9.50 2.81 , 9°73 2.43 22 8.91 2291 

Calculation_R 25 27.08 4.05 18 26.11 4.74 22 26.59 Sell 22 25.95 3.20 

Operation span_R 25 5.64 4.69 18 6.72 5.85 22 B95 Bw 22 4.50 4.10 
Posttest 

Problem solving_R 25 10.4 2.90 18 10.83 2A) 20 10.5 IRO3 19 9.68 2.38 

Calculation_R 25 29) 3.21 18 28.44 4.2 20 DTS 292, 20 299 2.88 

Operation span_R DS 7.32 4.23 18 8.00 5.64 22 7.00 3.70 22 6.09 4.03 


Note. Rat the end refers to raw score; _S at the end refers to standard or scale score. TOMA = Test of Math Ability; KeyM = KeyMath test; Math_S = 
mean scale-score (TOMA, KeyM); Raven = Raven Colored Matrices Test; TORC = Test of Reading Comprehension, WRAT = Wide Range Achievement 
Test; WIAT = Wechsler Individual Achievement Test; Concept = conceptual span; Sent/Dig = sentence/digit span; Update = updating measure, WMC = 
working memory capacity; Problem solving = word problems solving subtest from the Comprehensive Test of Math Abilities (CMAT); Calculation = 
arithmetic calculation subtest from the WRAT-3. uy 

@ Denotes composite mean z score of working memory span measures (conceptual span, digit/sentence span, and updating). 
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Table 3 


Estimated Adjusted Posttest Z Scores and Slopes for Problem Solving, Calculation, and 
Operation Span as a Function of Treatment and Ability Group 


Problem solving 


Calculation Operation span 





Group Adjusted M SE Adjusted M SE Adjusted M SE 
Posttest accuracy 

Verbal-only 

MD 0.99 0.29 1.28 0.42 0.54 0.37 

AVE 0.67 0.11 1.4 0.16 0.53 0.13 
Verbal + visual 

MD 0.26 0.24 1.46 0.34 0.28 0.31 

AVE 0.78 0.16 0.97 0.23 0.49 0.19 
Visual-only 

MD 1.04 0.24 1.94 0.35 1.09 0.31 

AVE 0.72 0.11 ! 0.17 0.89 0.14 
Control 

MD 0.16 0.27 0.88 0.40 0.05 035 

AVE 0.91 0.12 Py 0.17 0.16 0.14 

Slopes 

Verbal-only 

MD 0.69 0.42 0.38 0.59 0.22 0.52 

AVE 0.25 OMS 0.2 0.21 0.04 0.19 
Verbal + visual 

MD =0)0il 0.37 0.41 0.54 0.05 0.51 

AVE 0.21 0.19 0.78 0.28 0.01 0.24 
Visual-only 

MD 0.96 0.28 1.4 0.39 0.7 0.35 

AVE 0.22 0.13 0.16 0.18 anes) 0.16 
Control 

MD —0.68 0.37 —0.44 0.53 Oe 0.47 

AVE 38 0.19 S28 0.27 0.74 0.24 


Note. MD = children with math difficulties; AVE = children without math difficulties, or average achieving 


children. 


occurred for average achieving children when the WMC z score 
was at approximately 1.0 (see Figure 1b). Figures la and 1b also 
show a point at which treatment and control conditions intersect. 
As shown in Figure 1a, this intersection point occurred at approx- 
imately —.5 WMC z score for children with MD, whereas as 
shown in Figure 1b this intersection point occurred at approxi- 
mately .5 z score for average achieving children. Thus, as a 
follow-up to the covariate (WMC) by treatment interaction, post- 
test scores were estimated when made conditional on setting WMC 
to high (1.0 z score), middle (0 z score), and low (—1.0 z score) 
values.* Pretest CMAT scores again served as a covariate in the 
analysis. The estimated adjusted mean posttest scores when made 
conditional on setting WMC to high, middle, and low values are 
reported in Table 4. 

At the high WMC level (1.0 WMC values), significant treatment 
effects occurred at posttest for children with MD, F(3, 147) = 4.99, 
p = .003, but not for average achieving children, F(3, 147) = 1.85, 
p = .14. For children with MD, a Tukey test indicated a significant 
advantage at posttest for the visual-only and verbal-only condition 
compared to the other treatment conditions (visual-only = verbal- 
only > verbal + visual = control). A significant posttest advantage 
was found for average achieving children when compared to children 
with MD within the verbal + visual condition, F(1, 147) = 4.72, p < 
.04. In contrast, children with MD outperformed average achieving 
children at posttest within the visual-only condition, F(1, 147) = 4.05, 
p < .045. No other significant effects (ps >. 05) occurred in estimated 


posttest scores when made conditional on setting WMC to a high (1.0 
z score) level. 

At the middle WMC level (0 z score), no significant treatment 
effects occurred for children with MD, F(3, 147) = 2.39, p = .07, 
or for average achieving children, F(3, 147) = 1.23, p = .29. The 
only significant ability group difference in posttest performance 


*I did not compare treatment differences at each point of the WMC 
covariate. For parsimony, I selected a cutoff point (referred to as a pick a 
point approach; Rogosa, 1980) for comparisons at the lines where treat- 
ment outcomes started to diverge in Figure la and Figure 1b. Not unlike 
when using the Johnson—Neyman procedure (Rogosa, 1980), I did consider 
potential regions of significance. I initially utilized Bauer and Curran’s 
(2005) procedure (see Lazer & Zerbe, 2011, for SAS syntax) for computing 
the Johnson—Neyman technique for mixed models. However, because there 
was sufficient information to conclude the slopes were not equal across all 
conditions and my hypotheses were tied to variations in outcomes related 
to high and low WMC, I tested whether the adjusted posttest treatment 
outcomes depended on whether WMC was set at high, middle, and low 
values. This procedure, referred to as a treatment by covariate interaction 
design (e.g., Judd, McClelland, & Smith, 1996; Leon, Portera, Lowell, & 
Rheinheimer, 1998; Littell, Milliken, Stroup, Wolfinger, & Schabenberger, 
2010), has the advantage of testing whether the adjusted posttest scores as 
a function of treatment are conditional on the level at which WMC is set. 
The SAS syntax for computing the estimated adjusted means (least square 
means) at posttest conditional on setting the covariate to specific values is 
provided in Littell et al. (2010, p. 265). 
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a Regression Line for WMC and Strategy Training 
Group=Children with MD 
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b Regression Line for WMC and Strategy Training 
Group=Average Achiever 
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Figure J. a: Linear regression slope for Treatment * Working Memory 
Capacity (WMC) for children with math difficulties. b: Linear regression 
slope for Treatment X Working Memory Capacity for average achievers. 
MD = math difficulties. 


that occurred was within the control condition. Average achieving 
children yielded significantly higher adjusted posttest scores than 
did children with MD, F(1, 147) = 7.24, p = .008. 


At the low WMC level (—1.00 z score), a significant treatment 
effect occurred for children with MD, F(3, 147) = 7.12, p <.001, 
but not for average achieving children, F(3, 147) = 2.40, p = .07. 
For children with MD, a Tukey test indicated a posttest advantage 
(ps < .05) for the control and verbal + visual conditions when 
compared to the other conditions (control = verbal + visual > 
verbal-only = visual-only). No significant posttest score advan- 
tage occurred for children with or without MD within treatment 
conditions (ps > .05). 

Summary. The results clearly showed that WMC moderated 
treatment outcomes on measures of posttest solution accuracy. For 
children with MD with relatively high WMC, an estimated posttest 
treatment advantage was found for the verbal-only and visual-only 
conditions when compared to the control condition. In contrast, 
when WMC was set to a low level, none of the treatment condi- 
tions exceeded the control condition at posttest. 

For average achievers, no treatment advantages relative to the 
control condition occurred at posttest. Although the results showed 
higher slopes for the treatment conditions when compared to the 
control condition, none of the treatment conditions significantly 
improved posttest scores when compared to the control condition. 
These nonsignificant effects held regardless of whether WMC was 
set to high, middle, or low WMC values. 

Taken together, there is weak support for the compensatory 
hypothesis, which suggests that children with lower WMC are 
more likely to benefit from strategy conditions when compared to 
the control condition than are those with higher WMC values. 


Posttest Calculation Accuracy 


Because calculation practice was part of the intervention, I 
expected some improvements in arithmetic skills. The general 
analytic strategy as used before tested whether there were im- 
provements in calculation accuracy as a function of ability group 
and treatment conditions. A 2 (ability group) X 4 (treatment) 
mixed ANCOVA was computed on posttest calculation (WRAT-3) z 
scores. The covariates were the continuous variables of pretest 
calculation accuracy and WMC. The mixed ANCOVA yielded 
significant effects for the ability group x treatment interaction, 
FI, 144) = 5.24, p = .002; WMC X treatment interaction, F(3, 
141) = 3.32, p = .02; WMC, F(1, 145) = 4.58, p = .03; and the 
pretest WRAT-3 score, F(1, 141) = 164.64, p < .001. No other 
significant effects occurred (ps > .05). In particular, no significant 
main effects occurred for ability group, F(1, 147) = .31, p = .58; 
treatment conditions, F(3, 124) = 1.18, p = .32; or the ability 
group X treatment X WMC interaction, F(3, 129) = 2.20, p = .09. 
The adjusted posttest z score means for calculation accuracy are 
shown in in Table 3. 

A test of simple effects as a follow-up to the ability group X< 
treatment interaction indicated that treatment effects in adjusted 
posttest scores were significant for average achieving children, 
F(3, 145) = 8.27, p < .0001, but not for children with MD, F@, 
145) = 1.37, p = .25. For average achieving children, a Tukey test 
yielded a significant (ps < .05) adjusted posttest score advantage 
for the control condition when compared to the treatment condi- 
tions (control > visual-only = verbal-only > verbal + visual). 
Within treatment conditions, a significant advantage in adjusted 
posttest scores was found for average achieving children when 
compared to children with MD for the control condition, (147) = 
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Table 4 


Estimated Posttest Z Scores for Problem Solving, Calculation, and Operation Span as a 
Function of Treatment and Working Memory Capacity Set to High, Middle, and Low Values 


Problem solving 


Adjusted M SE 


Group 
High WMC 
Verbal-only 
MD 1.56 0.62 
AVE 0.88 0.12 
Verbal + visual 
MD SAIS) 0.54 
AVE 0.96 0.12 
Visual-only 
MD 1.85 0.46 
AVE 0.91 0.11 
Control 
MD —0.41 0.57 
AVE 0.59 0.13 
Middle WMC 
Verbal-only 
MD 0.87 O28 
AVE 0.63 0.12 
Verbal + visual 
MD 0.36 0.19 
AVE 0.74 0.18 
Visual-only 
MD 0.89 0.2 
AVE 0.69 0.13 
Control 
MD 0.27 0.22 
AVE 0.97 0.14 
Low WMC 
Verbal-only 
MD 0.18 0.26 
AVE 0.39 0.24 
Verbal + visual 
MD 0.97 0.25 
AVE 0.53 0.35 
Visual-only 
MD —0.08 0.18 
AVE 0.46 0.22 
Control 
MD 0.96 0.21 
AVE 35 0.3 


Calculation Operation span 

Adjusted M SE Adjusted M SE 
1.6 0.89 0.72 0.78 
1.56 0.17 0.56 0.14 
1.81 0.77 032 0.72 
1.62 0.17 0.51 0.15 
. 0.65 1.66 0.58 
1.39 0.17 0.68 0.13 
0.51 0.83 —0.38 0.72 
OT 0.19 0.78 0.16 

E22 0.34 0.5 0.3 
1.37 0.18 0.52 0.15 
1.4 0.27 0.27 0.24 
0.84 0.27 0.49 0.22 
i 7Ak 0.3 0.97 0.26 
1-23 0.18 0.93 0.16 
0.95 0.33 0.14 0.28 
2.24 0.2 0.03 0.17 
0.84 0.36 0.28 0.32 
ly 0.34 0.48 0.31 
0.98 0.37 0.22 0.35 
0.06 0.52 0.48 0.44 

0.31 0.24 0.27 0.2 
1.07 0.32 1.18 0.28 
139) 0.3 0.66 0.26 
22 0.44 = Osi 0.38 


Note. WMC = working memory capacity; MD = children with math difficulties; AVE = children without 
math difficulties, or average achieving children; High WMC = estimated adjusted posttest score by setting to 
a WMC value of 1.0; Middle WMC = estimated adjusted posttest score by setting to a WMC z-score value of 
0; Low WMC = estimated adjusted posttest score by setting to a WMC value of —1.0. 


9.31, p = .003. No other significant effects (all ps > .05) emerged 
comparing the adjusted posttest scores. 

As shown in Table 3, when slopes were compared between 
ability groups, a larger slope was found for children with MD 
when compared to average achieving children within the visual- 
only condition, 1128) = 2.91, p < .004. No other significant 
differences (ps > .05) in slopes occurred between ability groups 
within conditions. Because no significant ability group X treat- 
ment X WMC interaction occurred, slopes were collapsed across 
ability groups for the comparison between treatment conditions. 
When compared to those for the control condition, slopes were 
significantly higher for the verbal + visual condition, 1(147) = 
2.22, p = .02, and visual-only condition, 11146) = 3.07, p = .003. 
No significant difference in slopes occurred comparing the control 
condition to the verbal-only condition, (143) = 1.49, p = .14. 


As a follow-up to the significant WMC X treatment interaction, 
I again set WMC to high (1.0), middle (0), and low (— 1.0) values. 
The estimated posttest scores as a function of WMC values are 
shown in Table 4. These values were again selected so compari- 
sons in posttest outcomes could be made across the three depen- 
dent measures. Because no significant group X treatment X WMC 
interaction emerged, comparisons between treatments focused on 
the total sample as a function of WMC values. No significant 
treatment differences (all ps > .05) occurred in the estimated 
adjusted posttest scores as a function of treatment conditions at 
high and middle WMC values (z scores of 1.0 and 0). In contrast, 
when WMC was set to low values (—1.0 z score), children in the 
control condition performed better than those in the treatment 
conditions. A Tukey test (p < .05) yielded a significant advantage 
in adjusted posttest scores for the control when compared to the 
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treatment conditions (control > verbal-only = verbal + visual = 
visual-only). 

Summary. No support was found for the notion that, rela- 
tive to the control condition, strategy conditions provided ad- 
ditional advantages in posttest calculation performance for chil- 
dren with MD. Likewise, no posttest advantages as a function of 
strategy conditions were found for average achievers relative to 
the control condition. Regardless of ability group, when WMC 
was set to high and middle levels, no treatment advantages 
occurred relative to the untreated control condition. In fact, 
when WMC was set to a low level, children in the control 
condition actually performed better at posttest than did those in 
the treatment conditions. 


Posttest Operation Span 


Because strategy interventions included practice with word 
problems that gradually increased interference or distraction dur- 
ing the training sessions (the number of irrelevant sentences across 
sessions was gradually increased), I expected that this activity, 
coupled with the strategy instruction, played an important role in 
treatment outcomes on working memory measures. Because WMC 
is defined as including the inhibition of distracting information 
(e.g., Engle et al., 1999), I tested whether some transfer effects 
occurred on the operation span measure. A 2 (ability group) x 4 
(treatment) mixed ANCOVA was computed on operation span 
posttest z scores. The covariates were pretest operation span and 
WMC. Both of these covariates were continuous variables. The 
mixed ANCOVA yielded significant effects for treatment, F(3, 
146) = 5.16, p = .002; for the ability group X treatment X WMC 
interaction, F(3, 146) = 3.77, p = .01; and for the pretest operation 
span score, F(1, 146) = 224.82, p < .0001. No other significant 
effects occurred (all ps > .05).The adjusted posttest means as 
well as slopes as a function of group and treatment are shown in 
Table 3. 

As in the previous analyses, a comparison was made between 
slopes within treatment conditions as a function of ability group. 
As shown in Table 3, slopes were significantly higher for children 
with MD than for average achieving children within the visual- 
only condition, t(147) = 2.46, p = .01. No other significant group 
differences (ps > .05) in slopes occurred within the remaining 
treatment conditions (all ps > .05). I next compared the slopes of 
the control condition with those of the other treatment conditions 
within each ability group. When compared to those for children 
with MD in the control condition, the slopes were significantly 
larger for children with MD in the visual-condition, t(146) = 2.07, 
p = .04. No other significant slope differences (all ps > .05) 
occurred between the control and treatment conditions within the 
MD group. When compared to those for average achieving chil- 
dren in the control condition, the slopes were significantly smaller 
for average achieving children in the verbal-only condition, (146) = 
—2,30, p = .02; verbal + visual condition, 1146) = —2.14, p = 
.03; and visual-only condition, 11146) = —3.47, p < .0001. 

To follow up on the WMC by treatment by ability group 
interaction, I again set WMC to high (1.0), middle (0), and low 
(—1.0) values. At the higher WMC values, no significant treatment 
effects occurred for children with MD, F(3, 146) = 1.75, p = .15, 
or for average achieving children, F(3, 146) = 0.63, p = .59. 


Further, no significant effects occurred between groups within 
treatment conditions (all ps > .05). 

At the middle WMC level, a significant treatment effect oc- 
curred for average achieving children, F(3, 146) = 5.00, p < .003, 
but not for children with MD, F(3, 146) = 1.95, p = .12. For 
average achieving children, a Tukey test indicated a significant 
advantage (ps < .05) for the visual-only condition relative to the 
other conditions (visual-only > verbal-only = verbal + visual > 
control). No significant effects (ps > .05) occurred between 
groups within treatment conditions (all ps > .05). 

At the low level of WMC (—1.0 z score), a significant treatment 
effect occurred for average achieving children, F(3, 146) = 5.23, 
p < .002, but not for children with MD, F(3, 146) = 0.57, p = .63. 
For average achieving children, a Tukey test yielded a significant 
adjusted posttest advantage (ps < .05) for the visual-only condi- 
tion relative to the other conditions (visual-only > verbal-only = 
verbal + visual > control). Within conditions, a significant esti- 
mated posttest advantage occurred for average achieving children 
when compared to children with MD for the visual-only condition, 
t(146) = 6.71, p = .01. In contrast, the estimated adjusted posttest 
scores were higher for children with MD than for average achiey- 
ing children in the control condition, #1146) = 8.68, p = .004. No 
other significant group effects on estimated adjusted posttest 
scores occurred at low WMC level. 

Summary. For average achieving children and WMC values 
in the low and middle range, a significant treatment advantage was 
found for the visual-only condition when compared to the other 
conditions. For children with MD, no clear treatment advantages in 
adjusted posttest operation span scores occurred when compared to 
those in the control condition. 


Effect Sizes 


The above statistical outcomes for the various treatment condi- 
tions clearly were related to the power in my analysis. Thus, to 
partially address this issue, I report effect sizes (ESs) in Table 5. I 
calculated Hedges’s g = y/[(n, — 1)(SD7) + (ny — 1)(SD3)(n, + 
n> — 2)]'°, where + is the hierarchical linear modeling coefficient 
for the intervention effect, which represents the mean difference 
between treatment adjusted for both Level | and Level 2 covari- 
ates; n, and n, are the sample sizes; and SD, and SD, are the 
unadjusted posttest standard deviations (What Works Clearing- 
house, 2006; see Formula 10), respectively. The Level 2 coeffi- 
cients were adjusted for the Level | covariates. For the interpre- 
tation of the magnitude of the effect sizes (ESs), Cohen’s (1988) 
distinction was used; an ES of 0.20 is considered small, and ESs of 
0.50 and 0.80 are considered moderate and large, respectively. 

Table 5 shows the magnitude of ESs at posttest for children with 
MD and average achievers. Reported are the effect sizes for the 
adjusted posttest scores estimated at high, middle, and low WMC 
values. Also reported are the adjusted posttest effect sizes when 
WMC was left to covary. That is, adjusted posttest outcomes are 
compared without setting WMC to a specific value. Effect sizes 
comparing the treatment to the control condition that yielded effect 
sizes at or aboye .80 are in bold. 

For those children with MD and when WMC was set to high 
values, high effect sizes in favor of verbal-only and visual-visual- 
only conditions occurred when compared to the control condition 
on posttest outcome measures of problem solving. The results also 
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Table 5 

Effect Sizes for Adjusted Posttest Scores for Problem Solving, 
Calculation, and Operation Span as a Function of Treatment 
and Working Memory Capacity at High, Middle, and 

Low Values 





Group Vie AWN We Dik d 2veeeh Sinise 2 





Children with MD 
Problem solving 


High WMC De 0.31 2.61  —2.10 0.20 2.35 
Middle WMC 0.63 a(S O79 O52 0.11 0.64 
Low WMC —0.98 O23 E02 1.06 0.02 —1.07 
Sample* 0.89 —0.06 1.09 —0.78 0.12 0.92 
Calculation 
High WMC (28) ee ta 05a rill 1.3 2.36 
Middle WMC —0.19 —0.48 O26 ie O82 0.45 0.69 
Low WMC —0.16 (Ol) eno) 0.68 —0.40 —0.98 
Sample* —=()\) —0.64 0.39 —0.48 0.59 0.97 


Operation span 
High WMC OS4 S112 isfy = ILS 0.99 2.50 
Middle WMC 0.32 —0.55 O52 0:83 0.18 1.01 
Low WMC 0.09 Ol02 ee 05455 0.065 = 0163) 
Sample* OBS = OKs O70 0:96 0.32 1.27 


Average achievers 
Problem solving 


High WMC = Ont —0.04 0.42 0.08 0.57 0.46 


Middle WMC —0.14 —0.07  —0.48 0.09 —0.34 —0.41 
Low WMC SOs Onl kesh 0.1 ee eels 
Sample* =OW3 =O 0.318) OM. ONO =O 
Calculation 
High WMC —0.05 0.16 —0.36 O22 0.33 0:52 
Middle WMC 0.4 OW ORS SOs =e SON 
Low WMC 0.84 OD let) SOUS wa Slew 
Sample* 0.32 OMS OOO See LOn Occ 
Operation span 
High WMC OSS ONS 0225 50:2 =O29— =, 1@) 
Middle WMC 0.02 —0.43 0.48 —0.49 0.49 0.90 
Low WMC 001  —0.74 1.19 —0.78 1.26 1.89 
Sample* 0.03 —0.38 0.36 —0.44 0.35 0.73 


Effect sizes at or greater than .80 when compared to the control condition 
are shown in boldface type. Conditions are denoted as follows: 1 = 
verbal-only, 2 = verbal + visual, 3 = visual-only, 4 = control. Positive 
effect sizes in favor of the first number (e.g., | vs. 2) indicated an 
advantage for the verbal condition when compared to the verbal + visual 
(effect size = 2.23). Settings for working memory capacity (WMC) were 
1.0 for high WMC, 0 for middle WMC, and —1.0 for low WMC. 

*“ Estimates were not made conditional on setting WMC to a specific value. 


showed high effect sizes for all three treatment conditions relative 
to the control condition on estimated posttest measures of calcu- 
lation accuracy and operation span. The treatment condition that 
yielded consistently high ESs across all dependent measures when 
compared to the control condition was the visual-only condition 
(ESs ranged from 2.35 to 2.50). When WMC was set to low 
values, the effect sizes between treatment and control conditions 
were negative across all three dependent measures. This finding 
suggested that when WMC values were set to a low level, an 
advantage was found for the control when compared to the treat- 
ment conditions. Thus, no support was found for the notion that 
strategy conditions facilitated compensatory processing for chil- 
dren with low WMC. 

For average achievers with high WMC values, ESs of moderate 
magnitude occurred for strategy conditions when compared to the 
control condition (ESs ranged from .42 to .57) on the posttest 


problem solving measure. However, for average achievers with 
WMC values in the middle range, no advantages were found for a 
particular strategy condition when compared to the control condi- 
tion on posttest measures of problem solving or calculation accu- 
racy (ESs ranged from — 1.20 to — 1.29). An advantage for average 
achievers on the posttest operation span measure occurred for 
those with relatively lower WMC values. The visual-only condi- 
tion was particularly robust when compared to the control condi- 
tion (ES = 1.89). 


Discussion 


This study investigated the role of strategy instruction and 
working memory capacity on word problem solving accuracy and 
transfer measures in children with MD. The results showed a 
significant WMC X treatment interaction across all criterion mea- 
sures. In general, the results indicated that working memory ca- 
pacity played an important role in moderating the effectiveness of 
strategies on posttest performance outcomes. For children with 
MD, positive effect sizes in favor of verbal or visual conditions 
occurred when compared to the control condition on posttest 
measures of problem solving and calculation accuracy. However, 
these effects were isolated to children with relatively higher WMC 
scores. In contrast, no significant strategy treatment advantages 
occurred relative to the control condition for those children with 
MD and relatively low WMC. The treatment condition found 
particularly advantageous to children with MD who had relatively 
higher WMC across all dependent measures was the visual-only 
condition. Children with higher WMC, especially those with MD, 
were more likely to benefit from the diagramming condition than 
were those with lower WMC. The results will now be discussed in 
terms of the question that directed the study: Does WMC play an 
important role in accounting for cognitive strategy outcomes? 

Although an answer to this question is in the affirmative, the 
results must be placed in the context of the four models discussed 
in the introduction. One model argued that if reading, computation, 
and general fluid intelligence were relatively intact (in the normal 
range) for children with MD (as was the case in this study), then 
the reliable use of cognitive strategies supersedes the role that any 
individual differences in WMC might play. For the present study, 
scores for children with MD in the areas of reading, calculation, 
and fluid intelligence were in the normal range. Further, perfor- 
mance on these measures was statistically comparable among 
children with MD across treatment conditions. Thus, one would 
predict from this model minimal variation in strategy outcomes as 
a function of WMC. However, I found, in contrast to this hypoth- 
esis, that WMC interacted with strategy conditions on all criterion 
measures. Thus, the results do not support the notion that WMC 
plays a secondary role in problem-solving outcomes related to 
treatment conditions for children with MD. 

A second model suggested that a limited-capacity WM system 
underlies word problem solving difficulties in children with MD. 
This model is consistent with the notions of several theorists, who 
adopt a general resource approach in which individual differences 
on cognitive and aptitude measures draw on a limited supply of 
WM resources (e.g., Colom et al., 2008). This model assumes that 
although WMC may act in tandem with other processes, this 
general system may operate independent of strategy conditions. 
This model was not supported because a significant moderating 
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effect (interaction) emerged between WMC and strategy condi- 
tions, suggesting that certain strategies draw upon more working 
memory resources than others. As shown in Table 4, across all 
criterion measures, children with MD who had relatively higher 
WMC benefited from strategy conditions (i.e., verbal-only, visual- 
only) relative to the control condition. Such was not the case for 
children with MD who had low WMC scores. 

A third model suggests that strategy training compensates for 
individual differences in WMC. Some studies have shown that 
strategy training helps low-span participants allocate WM _ re- 
sources more efficiently than it does high-span participants (e.g., 
Turley-Ames & Whitfield, 2003). Thus, I expected that children 
with MD, especially those with relatively lower WM span, would 
benefit more from strategy instruction when compared to the 
control condition than would average achieving children (children 
with high spans). Such was not the case in this study. Overall, this 
study showed that children with MD and low WMC in the strategy 
conditions did not improve their problem-solving performance 
relative to those in the control condition. 

Thus, the model I prefer suggested that training in cognitive 
strategies was more likely to improve problem-solving outcomes 
for children with MD but with a relatively larger WMC. This is 
because these children have spare WM sources with which to 
effectively utilize these strategies. The general patterns of the 
current study are in line with this model. The results show that 
WMC, as a continuous variable, interacted with strategy condi- 
tions in predicting solution accuracy. There are, however, at least 
four qualifications to the results. 

First, the potential moderating effects of WMC may change with 
longer intervention periods. Models of skill acquisition (e.g., Ack- 
erman, 1988) suggest that WMC may be important in the early 
phases of skill acquisition but that it becomes less important with 
longer interventions, as the implementation of strategies is autom- 
atized. Although this study cannot test this hypothesis, it may be 
the case with repeated use of strategies that the effects of WMC 
and the disadvantages of strategies in children with MD would be 
reduced. 

Second, adjusted posttest scores as a function of WMC values 
were estimates from a simple linear regression. It is unlikely that 
I had enough children with MD performing at a high WMC level 
or enough average achieving children performing at a low level of 
WMC to capture subtle differences in treatment outcomes. Thus, 
my predicted adjusted posttest means (adjusted least square 
means) may require comparisons at setting WMC to less extreme 
values. However, when | considered effect sizes computed on 
adjusted posttest scores that were not conditional on setting WMC 
to specific levels, high ESs still emerged in favor of the verbal- 
only and visual-only conditions (1.09, .92, respectively) relative to 
the control condition for problem-solving accuracy. 

Third, the WMC effect for the verbal + visual condition is 
unclear. Although children with MD who have relatively higher 
WMC were more likely to benefit from strategy training when 
compared to those in the control condition, especially for those 
under visual-only conditions, only small effects were found for the 
verbal + visual condition regardless of variations in WMC (ESs 
varied from .02 to .20). I assumed that the verbal + visual 
condition would increase the child’s chances to draw upon sepa- 
rate verbal and visual-spatial storage capacities. Thus, the combi- 
nation of these storage systems, I assumed, would open up the 


possibility that more information could be processed and retained 
without making excessive demands on WMC (Mayer, 2005). Such 
did not appear to be the case in this study. It is possible that 
children with MD may have preferred an activation of a single 
storage system and may possibly have viewed the combination of 
verbal (attention to cues) and visual (diagramming) information as 
distracting or as interfering with more efficient processing. No 
doubt, further research on this issue is necessary. 

Finally, the mechanism that played a role toward improving 
operation span performance in both ability groups is also unclear. 
One possible explanation was that the operation span measure was 
a novel measure and the strategy conditions may have provided 
some practice in working memory. That is, participants were 
provided practice in recalling targeted information in the context 
of distracting information (identifying relevant and irrelevant 
propositions within word problems), a process attributed to work- 
ing memory (e.g., Engle et al., 1999). This explanation is consis- 
tent with studies that have attempted to directly intervene on 
working memory performance and influence achievement (e.g., 
Holmes, Gathercole, & Dunning, 2009; Klingberg et al., 2005). No 
studies that I am aware of, however, have shown that strategy 
training within an academic domain (word problem solving) di- 
rectly influences WM or vice versa (e.g., see Holmes et al., 2009, 
for discussion of the sleeper effect). Perhaps the approach I took to 
enhance transfer by embedding working memory demands (load) 
within the curriculum may be an important avenue in future 
research. It may also be the simple case, however, that because 
basic calculation was involved in the training, and because calcu- 
lation was embedded in the operation span measure, this may have 
accounted for the transfer effects. 


Implications 


Our findings have several applications to current research. First, 
the study may account for why some children benefit from strategy 
instructions and others do not. I found that a key variable in 
accounting for the outcomes was WMC. Clearly, WMC would not 
be the only variable across studies to account for the outcomes; 
however, the role of WMC in this study appeared to be fairly 
robust. It may be the case that when children with computation 
and/or reading difficulties are included in the analysis that effects 
would be different. Thus, despite the poor treatment outcomes for 
children with MD and with low WMC relative to the control 
condition, it is important to note that children in this study had 
reading and computation scores within the average range. 

Second, for children with MD and low WMC, none of the 
strategies were found particularly effective relative to the control 
condition. In fact, the visual-spatial strategy condition, which 
included diagramming, yielded substantially lower posttest scores 
than did the control condition on posttest measures of problem 
solving (ES = —1.07) and calculation (ES = —.98). This finding 
aligns with several studies that have suggested that visual-spatial 
WM (represented by the visual-spatial sketchpad) is closely linked 
with MD (e.g., Bull, Espy, & Wiebe, 2008). However, a recent 
meta-analysis synthesizing research on cognitive studies of MD 
(Swanson & Jerman, 2006) suggests that memory deficits are more 
apparent in the verbal than the visual spatial WM domain. My 
findings do suggest, however, that a moderate advantage was 
found relative to the control condition by combining verbal and 


846 SWANSON 


visual training for MD children with WMC values in the low range 
(ES = .14) and the relatively high range (ES = .20). The sugges- 
tion, perhaps, is that both routes are important route for remedia- 
tion. 

An obvious question emerges as to why the visual-spatial strat- 
egy (diagramming) alone condition favored children with higher 
WMC value scores compared to those with low WMC scores. My 
best explanation is that the use of diagrams is resource demanding. 
It is also possible that not all children had adequate resources to 
enact this visual strategy without placing excessive demands on 
working memory. The visual-spatial strategy, however, may have 
provided a technique that allowed children with high WMC to 
focus on the relevant aspects of the task. Diagramming numbers 
might have activated the relevant information while preventing 
irrelevant information from interfering with problem-solving so- 
lutions. Taken together, the results suggest that visual diagram- 
ming is an effective intervention for some children with MD in 
order to increase solution accuracy. 

Third, verbal-only and visual-only strategy conditions facilitated 
calculation proficiency for children with MD and relatively higher 
WMC relative to the control condition, but they decreased perfor- 
mance for average achievers (see Table 4). Improvement in cal- 
culation was part of each lesson plan, and practice and feedback 
therefore could have played a role in the performance of children 
with MD. As shown in the standard scores reported in Table 1, 
children with MD had lower calculation scores than did average 
achieving children; therefore, strategy instruction may have pro- 
vided an additional boost in performance. The outcomes for the 
average achievers, however, are less clear. I infer that the out- 
comes may be related to classroom testing. In all classrooms, 
children were exposed to daily |-minute calculation tests (part of 
a curriculum based assessment measure); therefore, I infer that as 
average achieving children became increasingly fluent in calcula- 
tions, strategy conditions may have actually interfered (i.e., slowed 
them down) with skills that were fairly well automatized. 

A final application relates to improvement on a norm-referenced 
test. The majority of intervention studies for problem solving have 
shown gains on experimental measures and less gain on standard- 
ized measures (for reviews, see Jitendra & Xin, 1997; Powell, 
2011). Thus, in the current study, I was able to improve perfor- 
mance substantially on materials related to standardized tests. 
Although I used z scores (based on raw scores) rather than national 
norms to compare treatment conditions, it is important to note that 
in the treatment X covariate analyses, estimated adjusted problem 
solving posttest scores for children with MD and relative higher 
WMC exceeded scores for those in the control condition, espe- 
cially those in the verbal and/or visual condition. 


Summary 


Taken together, these findings suggest that WMC moderates the 
influence of cognitive strategies. The results suggest that solution 
accuracy for children with MD, relative to the control condition, 
improved substantially as a function of both verbal and visual 
strategy training for those with relatively higher WMC. Addition- 
ally, weak support was found for the assumption that strategy 
conditions compensate for low WMC in children with MD. 
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Learning With Retrieval-Based Concept Mapping 


Janell R. Blunt and Jeffrey D. Karpicke 


Purdue University 


Students typically create concept maps while they view the material they are trying to learn. In these 
circumstances, concept mapping serves as an elaborative study activity—students are not required to 
retrieve the material they are learning. In 2 experiments, we examined the effectiveness of concept 
mapping when it is used as a retrieval practice activity. In Experiment 1, students read educational texts 
and practiced retrieval either by writing down as many ideas as they could recall in paragraph format or 
by creating a concept map (retrieval-based concept mapping). In Experiment 2, we factorially crossed the 
format of the activity (paragraph vs. concept map) and the presence or absence of the text (i.e., whether 
the activity involved repeated studying or retrieval practice). On a final test 1 week later that assessed 
verbatim knowledge and inferencing, both paragraph and concept map retrieval practice formats 
produced better performance than additional studying, but the 2 retrieval formats themselves did not 
differ. The results demonstrate the effectiveness of concept mapping when it is used as a retrieval practice 
activity and show that retrieval itself, rather than merely the act of writing, drives the benefits of 


© 2014 American Psychological Association 
0022-0663/14/$12.00 DOI: 10.1037/a0035934 


retrieval-based learning activities. 


Keywords: retrieval practice, concept mapping, learning, writing, study strategies 


Learning is often viewed as a process that occurs primarily 
when people encode or study new material, and the best learning 
is thought to occur when students elaborate on what they :are 
studying by forming meaningful connections and creating enriched 
knowledge structures. Retrieval, which occurs when students take 
tests, is viewed as an assessment of learning that occurred in prior 
study experiences but is not thought to create learning itself. In 
contrast to the latter assumption, a great deal of recent research has 
shown that practicing retrieval creates long-term, meaningful 
learning, sometimes even more learning than elaborative encoding 
activities (see Karpicke, 2012; Karpicke & Blunt, 2011). Our 
purpose in this article was to examine the effectiveness of two 
different retrieval practice formats: retrieving by writing informa- 
tion in paragraph format (a common way to induce retrieval 
practice; Roediger & Karpicke, 2006), and retrieving by creating 
what we refer to as retrieval-based concept maps. 

The exact mechanisms underlying the effects of retrieval prac- 
tice have not yet been specified, but the idea that retrieval practice 
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effects stem from elaborative study processes has recently been 
called into question. If elaboration were responsible for retrieval 
practice effects, then engaging in repeated retrieval should produce 
the same or similar effects as engaging in repeated elaborative 
studying. Recent research, however, has shown that repeated re- 
trieval consistently produces greater levels of long-term learning 
than elaborative studying. For example, Karpicke and Smith 
(2012) found that retrieval practice produced superior long-term 
retention relative to imagery-based and verbal elaborative study 
methods (see too Karpicke & Blunt, 2011). Instead of elaborative 
study processes, the benefits of retrieval practice are currently 
thought to stem from processes involved recollecting the context 
of a prior learning episode (Karpicke, Lehman, & Aue, in press). 
Remembering what occurred at a particular place and time is not 
necessary during a semantic elaboration task, but it is inherent to 
retrieval practice. As evidence for this account of retrieval-based 
learning, Karpicke and Zaromb (2010) showed that having people 
intentionally retrieve a prior event (in their experiments, the com- 
pletions to word fragments) led to greater subsequent retention 
relative to asking people to generate knowledge without thinking 
back to the past (e.g., completing fragments with the first words 
that came to mind). Therefore, an essential component of retrieval- 
based learning is what Tulving (1983) called being in an episodic 
retrieval mode, which refers to the act of thinking back to what 
occurred in a particular place and time in the past. Retrieval-based 
learning activities should be aimed at guiding students to inten- 
tionally recollect prior experiences. 

In the experiments reported here, we examined the effectiveness 
of using one popular learning task, concept mapping, as a retrieval- 
based learning activity. Concept mapping is a graphic organiza- 
tional technique in which students create node-and-link diagrams, 
where nodes represent concepts and links connecting the nodes 
represent relations among the concepts (see Figure 1; Novak & 
Gowin, 1984). Typically, students construct concept maps while 
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Figure 1. An example of a concept map created by a student in Experiment 1. 


they view the materials they are learning. Although this presum- 
ably allows students to enrich the material by encoding meaningful 
relationships among concepts, when students create concept maps 
while viewing the to-be-learned materials, they are not required to 
practice retrieving the materials. 

Recently, we carried out two experiments in which we directly 
compared the effectiveness of retrieval practice and elaborative 
studying with concept mapping (Karpicke & Blunt, 2011). Stu- 
dents studied educational texts on various science topics and either 
practiced retrieval or created concept maps of the texts. In the 
concept map conditions, students created concept maps while 
viewing the texts, whereas in the retrieval practice conditions, 
students read the texts and practiced retrieval by writing down as 
much of the material as they could recall without viewing the texts 
(a standard way of implementing retrieval practice for educational 
texts, which we refer to in this article as paragraph format; see 
Karpicke & Roediger, 2010; Roediger & Karpicke, 2006). The 
effects of these activities were assessed on final tests 1 week after 
the original learning phase. Practicing retrieval produced better 
long-term learning than elaborative concept mapping on final 
short-answer questions that assessed verbatim knowledge (items 
stated directly in the original text) and inferential knowledge 
(questions that required students to connect multiple concepts in 
the text). Furthermore, the benefits of retrieval practice were 
observed not only on short-answer questions but also on final 
assessments that involved creating a concept map of the material 
(Experiment 2 in Karpicke & Blunt, 2011). Thus, practicing re- 
trieval produced more learning than creating concept maps when 


the concept mapping activity was used as an elaborative study 
method. 

Concept mapping could be used as technique to implement 
retrieval practice, and there are reasons to expect that concept 
mapping might serve as an effective retrieval-based learning ac- 
tivity. Specifically, concept mapping requires students to identify 
the main concepts in a text (Hay, Kinchin, & Lygo-Baker, 2008; 
Stewart, Van Kirk, & Rowell, 1979) and then identify how the 
concepts are related to each other, which helps focus students on 
the organizational structure of the material (Vanides, Yin, Tomita, 
& Ruiz-Primo, 2005). It is also assumed that creating concept 
maps helps student use their own prior knowledge to identify how 
concepts might be related (Novak, 1976). Thus, concept mapping 
is thought to promote not only students’ verbatim knowledge and 
comprehension but also students’ abilities to make inferences 
about what they are learning (Novak & Gowin, 1984). 

Alternatively, there are also reasons to expect that concept 
mapping might not serve as an effective retrieval-based learning 
activity. When students freely recall material, they must adopt a 
retrieval strategy to guide their recall output (e.g., when recalling 
texts in paragraph format, students tend to recall in serial order, 
presumably to preserve the text structure; see Karpicke & Roedi- 
ger, 2010). Concept mapping might require students to adopt an 
ineffective retrieval strategy or might disrupt students’ default 
strategies, which could weaken the benefits of retrieval practice. It 
is also possible that asking students to retrieve knowledge in 
concept map format could introduce additional cognitive load 
during the process of retrieval, or the mapping task might function 
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as a secondary task that divides students’ attention. Either factor 
could reduce the effectiveness of retrieval practice. Finally, 
retrieval-based concept mapping might produce learning that is 
equivalent to the learning afforded by practicing retrieval in para- 
graph format. This outcome would be expected if the organiza- 
tional processing thought to occur during concept mapping were 
redundant with relational processing already afforded by para- 
graph recall and if both retrieval formats effectively allowed 
students to recollect the prior episodic context, which is the mech- 
anism considered central to retrieval-based learning (Karpicke et 
al., in press; Karpicke & Zaromb, 2010). 

In the present experiments, students read brief educational texts 
and practiced retrieval by writing in paragraph format or by 
creating concept maps. The effects of these retrieval practice 
activities were examined on a delayed short-answer test 1 week 
after the original learning phase. We also examined students’ 
subjective experiences of the different activity formats. We were 
especially interested in students’ judgments of learning (their 
predictions of how well they would perform in the future), but we 
also examined students’ ratings of how interesting, difficult, and 
enjoyable the activities were. The inclusion of these metacognitive 
judgments allowed us to examine the correspondence between 
students’ actual learning and their predicted performance, which is 
especially important to examine in light of claims that concept 
mapping represents “the most important metacognitive tool in 
science education” (Mintzes, Wandersee, & Novak. 1997, p. 424). 


Experiment 1 


Experiment 1 was a conceptual replication of Karpicke and 
Blunt (2011) with one important change: Rather than having 
students create concept maps while viewing texts, we had them 
create concept maps in the absence of the texts. Thus, we directly 
compared two different retrieval practice formats: concept map- 
ping and paragraph recall. Students read and practiced retrieval of 
brief science texts. During concept map retrieval practice, students 
retrieved the material by creating a concept map, whereas during 
paragraph retrieval practice, students wrote as much of the material 
as they could recall in paragraph format. The students then made 
a series of metacognitive judgments (judgments of learning, inter- 
est, difficulty, and enjoyment). The effects of the two retrieval 
practice formats were assessed on a final test 1 week after the 
original learning phase that included both verbatim and inference 
short-answer questions. 


Method 


Subjects. Thirty-two Purdue University undergraduates par- 
ticipated in partial fulfillment of course requirements. 

Materials. Two brief texts were selected from Cook and 
Mayer (1988, as described and used by Karpicke & Blunt, 2011). 
One text, “The Human Ear,” had a sequential structure (Meyer, 
1975), which means the text described a connected series of events 
and steps in a process (the sequence of events involved in the 
process of hearing). The other text, “Make-Up of Human Blood,” 
had an enumeration structure, which means that the text listed and 
described a series of concepts (the properties of different blood 
components). The texts were 259 and 236 words in length, respec- 
tively. 


Design. The two retrieval formats (concept map vs. para- 
graph) were manipulated within subject. Each student studied two 
texts and practiced retrieval of one text in concept map format and 
the other in paragraph format. The order of the two texts and the 
order in which students performed the two learning activities were 
counterbalanced across students. 

Procedure. Students were tested in small groups in two ses- 
sions. During the learning phase (Session 1), students read one text 
for 5 min, recalled it for 10 min, reread it for 5 min, and recalled 
it again for 10 min in one of the two retrieval practice conditions. 
They then repeated this procedure for the other text and other 
retrieval practice condition. 

Before completing the concept mapping retrieval practice con- 
dition, the students were instructed about the nature of the concept 
mapping activity. They were told that a concept map is a diagram 
in which concepts are represented as nodes that are linked together 
with words and phrases. The students were shown an example of 
a concept map selected from Novak (2005). Then, during recall 
periods, they were given a sheet of paper and told to recall the text 
by creating a concept map. Students were allowed to refer to the 
example concept map, but not to the text, throughout each 10-min 
recall period. Pilot testing showed that this was enough time for 
students to reach asymptotic levels of recall under these condi- 
tions. 

In the paragraph retrieval practice condition, students saw a 
response box on a computer screen and were told to recall as much 
of the information from the text as they could by typing their 
responses on the computer during each 10-min recall period (see 
Karpicke & Roediger, 2010). Overall, the total amount of learning 
time was identical in the elaborative concept mapping and retrieval 
practice conditions. 

At the end of each learning activity, the students were asked to 
predict how much of the material they would remember in 1 week 
(an aggregate judgment of learning) and to rate the enjoyment, 
difficulty, and interestingness of the activities. Students made their 
ratings on a scale from 0% to 100% in increments of 10 (0, 10, 20, 
.. . 80, 90, 100). At end of Session 1, after completing both 
activities, students indicated which retrieval practice format they 
preferred. 

The students were dismissed and returned to the laboratory 1 
week later for the final short-answer test, which included 10 
verbatim questions and four inference questions per text. Examples 
of questions are shown in the Appendix. During the final test, each 
question remained on the screen for at least 20 s; at that time, a 
button labeled “Next” appeared on the screen, and students pressed 
the button to proceed to the next question. Students were encour- 
aged to take as much time as needed to answer the questions. At 
the end of the second session, the students were debriefed and 
thanked for their participation. 


Results 


An initial analysis indicated that there were no differences 
among the counterbalancing orders, so the results have been col- 
lapsed across orders. There was a difference between texts such 
that initial and final performance was better on the “Make-Up of 
Human Blood” text than on the “Human Ear” text. However, text 
did not interact with any other factors in the experiment, so the 
results have been collapsed across texts. 
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Scoring. The texts were divided into 30 idea units for scoring 
purposes. Both the paragraph and concept map protocols were 
scored using the same criteria: Students were given 1 point for 
each idea unit recalled (Karpicke & Blunt, 2011; Karpicke & 
Roediger, 2010). On the final short-answer test, correct responses 
were given | point, and partially correct responses were given 
partial credit (e.g., .75, .50, or .25 points, depending on complete- 
ness of the response). Two independent raters scored all recall 
protocols and short-answer tests, and a third rater resolved all 
discrepancies to reach 100% agreement. 

Learning performance. The left side of Table 1 shows per- 
formance during the learning phase in Experiment | (the propor- 
tion of idea units recalled in each condition). Collapsed across 
retrieval formats, the proportion of ideas recalled increased from 
Period 1 to Period 2 (.39 vs. .55), #31) = 10.34, d = 1.83, 95% 
confidence interval (CI) [1.25, 2.39].' Students recalled more ideas 
in paragraph format than in concept map format. This pattern 
occurred in Period 1, (31) = 3.77, d = 0.66, 95% CI [0.28, 1.04], 
and in Period 2, #(31) = 5.52, d = 0.98, 95% CI [0.55, 1.39]. We 
examined the differences in initial recall in the concept map and 
paragraph conditions in a post hoc analysis reported in a later 
section. 

Final short-answer performance. Figure 2 shows perfor- 
mance on the final short-answer test that occurred 1 week after the 
initial learning phase. Performance was essentially equivalent in 
the concept map and paragraph retrieval practice format condi- 
tions. There were only small differences, slightly favoring the 
paragraph format over the concept map format, on the verbatim 
questions (.68 vs. .62), (31) = 1.07, d = 0.19, 95% CI [-0.16, 
0.54], and on the inference questions (.84 vs. .82), (31) = 0.41, 

= 0.07, 95% CI [—0.28, 0.42]. 

Subjective ratings. The right panel of Figure 2 shows stu- 
dents’ judgments of learning, and Table 2 shows students’ addi- 
tional ratings of their experiences during the learning tasks. There 
were very small differences in students’ judgments of learning, 
(31) = 0.26, d = 0.05, 95% CI [-—0.30, 0.39]; ratings of enjoy- 
ment, 731) = 0.31, d = 0.05, 95% CI [—0.29, 0.40]; ratings of 
task difficulty, (31) = 0.33, d = 0.06, 95% CI [—0.29, 0.40]; and 
ratings of the interestingness of the tasks, 7(31) = 1.04, d = 0.18, 
95% CI [—0.17, 0.53]. However, at the end of the initial learning 
phase, when students were asked to indicate which format they 
preferred, the majority of students preferred the paragraph format 
(20/32 students = 63%) to the concept map format (12/32 stu- 
dents = 37%). 


Table | 
Proportion of Idea Units Produced in Each Learning Period in 
Experiments I and 2 


Experiment | Experiment 2 
Learning activity Period | Period 2 Period 1 Period 2 
Retrieval practice (no text) 
Concept map SEO ZO POs) 39.083) 
Paragraph 44 (.03) .64(.03) .27(.03)  .48 (.04) 
Repeated study (text) 
Concept map — = 48 (.02) —.58 (.03) 
Paragraph — — 53 (.04) 62 (.03) 


Note. Standard errors of the means are shown in parentheses. 


Conditional analysis. In the next two sections, we report two 
sets of analyses aimed at exploring differences in recall in the 
concept map and paragraph conditions. The left portion of Table 3 
shows the results of an analysis of the relationship between initial 
learning performance and final short-answer performance (col- 
lapsed across question type) in Experiment 1. In order to analyze 
the fate of idea units on the final test, we coded short-answer 
questions based on the idea unit or units required to answer the 
questions. Verbatim questions typically required access to a single 
idea unit (collapsed across texts, M = 1.3 idea units per verbatim 
question). For example, the question “What happens when hemo- 
globin combines with oxygen?” corresponded to the idea unit 
“Hemoglobin releases oxygen to the lungs.” Inference questions 
required access to multiple idea units (collapsed across texts, M = 
2.3 idea units per inference question). For example, the question 
“What would happen if blood did not contain white blood cells, 
and bacteria were introduced to the body?” relies on the following 
idea units: (a) “White blood cells are mainly disease fighters”; (b) 
“White blood cells digest bacteria and other foreign material”; and 
(c) “When there is an infection somewhere in the body, white 
blood cells move toward it.” 

We followed Tulving’s (1964) method to analyze the correspon- 
dence in recall of individual idea units across two tests (see also 
Karpicke & Zaromb, 2010). C, refers to idea units produced in 
either Period | or 2 in the initial learning phase, and N, refers to 
idea units that were not produced in the initial learning phase. C, 
refers to short-answer questions correctly answered on the final 
short-answer test, and N, refers to questions not correctly an- 
swered on the final test. As shown in Table 3, the joint probability 
of recalling an idea initially and correctly answering a final short- 
answer question (C,C,) was greater in the paragraph condition 
than in the concept map condition, t(31) = 3.68, d = 0.65, 95% CI 
[0.26, 1.03]. Likewise, the probability of not recalling an idea but 
then correctly answering a final question (N,C,) was greater in the 
concept map condition than in the paragraph condition, #31) = 
2.45, d = 0.43, 95% CI [0.07, 0.79]. Together, these results reflect 
the fact that students initially recalled more ideas in paragraph 
format than they did in concept map format, yet the conditions 
produced equivalent levels of final short-answer performance. (We 
explore this pattern further in the analysis reported in the next 
section.) There was a small difference in intertest forgetting (the 
probability of recalling an idea but then failing to answer a short- 
answer question; C,N,) across conditions, with the paragraph 
condition showing slightly less forgetting, (31) = 1.67, d = 0.30, 
95% CI [—0.06, 0.65]. Finally, proportion of ideas not recalled or 
expressed on either test (N,N) was greater in the concept map 
condition relative to the paragraph condition, t(31) = 2.32, d = 
0.41, 95% CI [0.05, 0.77]. 

Initial recall and normative importance. Students might 
have produced fewer ideas during initial concept map recall rela- 
tive to initial paragraph recall because they adopted different 
output strategies in the two tasks. We reasoned that students might 
selectively produce only “important” ideas under concept map 


"We report standardized mean differences (ds) and 95% confidence 
intervals around the effect size estimates (see Cumming, 2012), which 
were calculated using the Methods for the Behavioral, Educational, and 
Social Sciences (MBESS) package for R (Kelley, 2007). 
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Final short-answer performance for verbatim questions and inference questions (left and middle 


panels), and judgments of learning (right panel) in Experiment 1. Error bars represent standard errors of the 


means. 


conditions. To examine this possibility, we had 16 undergraduate 
students, who were not subjects in either experiment reported here, 
rate the importance of all 30 idea units from both texts, using a 
scale ranging from 1| (not important at all) to 5 (very important). 
The average importance rating was calculated for each idea unit, 
and the intraclass correlation among the average ratings was .78, 
indicating good interrater reliability (Shrout & Fleiss, 1979). If 
students selectively included important ideas in the concept map 
condition, then the average importance rating of recalled ideas 
sheuld be greater in the concept map condition than in the para- 
graph recall condition. The results of our analysis confirmed this: 
Students tended to output ideas with higher normative importance 
ratings in the concept map condition (VM = 3.86, SE = 0.02) than 
in the paragraph condition (M = 3.76, SE = 0.02), (31) = 3.20, 
d = 0.57, 95% CI [0.19, 0.94]. Although the raw mean difference 
was small, the result was robust: for 28 of 32 students (88%), the 
mean normative importance of recalled ideas was greater in the 
concept map condition than in the paragraph condition. This anal- 
ysis indicates that students might have covertly retrieved the same 
number of ideas in both retrieval practice conditions (which would 
still benefit learning; see Smith, Roediger, & Karpicke, 2013), but 


Table 2 


students chose to include the relatively more important ideas when 
creating their concept maps. 


Discussion 


Experiment 1 showed that practicing retrieval in paragraph 
format or in concept map format produced approximately equiv- 
alent levels of performance on a delayed assessment of learning. 
Students also gave nearly identical subjective ratings to the two 
retrieval practice formats (judgments of learning and ratings of 
enjoyment, difficulty, and interestingness of the tasks), though 
students did tend to prefer the paragraph retrieval format relative to 
the concept map format. These results provide preliminary evi- 
dence that concept mapping may be an effective retrieval practice 
activity. Experiment 2 was carried out as a further investigation of 
the paragraph and concept map formats when used as either 
retrieval practice or repeated study activities. 


Experiment 2 


Experiment 2 was designed with two main purposes in mind. 
First, we sought to replicate Experiment | and generalize the 


Students’ Ratings of Enjoyment, Difficulty, and Interestingness of the Learning Activities in 


Experiments I and 2 





Experiment | 


Experiment 2 





Learning activity Enjoyment Difficulty Interest Enjoyment — Difficulty Interest 
Retrieval practice (no text) 
Concept map 49 (.04) .46 (.04) 49 (.04) 39 (.06) 55)((05) 42 (.06) 
Paragraph 51 (.04) 47 (.04) .53 (.04) .40 (.04) 54 (.06) .49 (.04) 
Repeated study (text) 
Concept map = — = 50 (.06) 40 (.04) 55 (.05) 
Paragraph — == — .29 (.06) .30 (.04) .32 (.06) 





Note. 


Students’ ratings were indicated on a scale from O (not at all) to 100 (totally). Ratings were then 


converted to proportions. Standard errors of the means are shown in parentheses. 
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Table 3 


Joint Probabilities Between Initial Performance and Final Short-Answer Performance in Experiments I and 2 
pA SE SS gE NAO RS RR RS AR te SE aE POM VPR COP mc OES MAIO SOO OPES OE 


Experiment 1 


Experiment 2 








Learning activity C,C, C,N, N,C, N,N, C,C, C,N, N,C, N,N, 
Retrieval practice (no text) 
Concept map .34 (.05) .08 (.02) .38 (.04) .20 (.05) .20 (.04) .13 (.02) .38 (.04) ){(AOS)) 
Paragraph ESOS) alli (U2) .29 (.02) .09 (.03) .28 (.04) MarCom) .31 (.02) .26 (.04) 
Repeated study (text) 
Concept map — — — — 28 (.04) .26 (.03) .24 (.02) .21 (.03) 
Paragraph -- _ — — .25 (.04) .31 (.02) .18 (.02) .26 (.04) 





Note. Standard errors of the means are shown in parentheses. C, = items produced during the initial learning activity; N, = items that were not produced 
during the initial learning activity; C, = questions correctly answered on the final short answer test; N, = questions not correctly answered on the final 


short answer test. 


results to a new set of text materials. Second, we included two new 
conditions to directly compare concept mapping and paragraph 
formats when they are used as retrieval practice activities (without 
the texts) with when they are used as repeated study activities 
(with the texts present; see Agarwal, Karpicke, Kang, Roediger, & 
McDermott, 2008; Agarwal & Roediger, 2011). Thus, in Experi- 
ment 2, we factorially crossed the presence of the material during 
the learning activity (text vs. no text) with the format of the 
learning activity (concept map vs. paragraph format). Our predic- 
tion was that the retrieval-based learning conditions would en- 
hance long-term retention more than the repeated study conditions, 
even though students in the two conditions completed the exact 
same activities either with or without the materials in front of 
them. This result would support the idea that practicing retrieval, 
rather than the mere act of writing down the material in paragraph 
or concept map format, is the key to promoting long-term learning. 


Method 


Subjects. Eighty Purdue University undergraduates partici- 
pated in partial fulfillment of course requirements. None of the 
students had participated in Experiment 1. 

Materials. Two science texts were based on information in 
Stabler, Metz, and Gier (2011). One text, “Enyzmes,” had a 
generalization structure (Meyer, 1975), which means the sentences 
in the passage provided clarification or examples of one main idea. 
The other text, “Domains of Life,” had an enumeration structure 
(like the “Make-Up of Human Blood” text used in Experiment 1). 
The texts were 283 and 282 words in length, respectively. 

Design. A 2 (activity format: concept map vs. paragraph) X 2 
(learning condition: repeated study vs. retrieval practice) between- 
subjects design was used. There were four conditions, and 20 
students were assigned to each condition. Each student completed 
the same activity for two texts, and the order in which the texts 
were presented was held constant across students. 

Procedure. The procedure was similar to the one used in 
Experiment 1. Students were tested in small groups in two ses- 
sions, and each student was assigned to one of four learning 
conditions: (a) repeated study—concept map, (b) repeated study— 
paragraph, (c) retrieval practice-concept map, and (d) retrieval 
practice—paragraph. During the learning phase, students read one 
text for 5 min, engaged in a learning activity for 10 min, reread the 
text for 5 min, and completed the learning activity again for 10 
min. Students then repeated the procedure for the second text. All 


instructions were identical in the repeated study and retrieval 
practice conditions, and the total amount of learning time was 
equivalent in all conditions. The only difference was that in the 
repeated study conditions, students viewed the texts while they 
completed the learning activities, whereas in the retrieval practice 
conditions the students completed the activities without the texts 
(as in Experiment 1). Thus, students in the repeated study—concept 
map condition completed their concept maps while reading the 
texts (Karpicke & Blunt, 2011), and students in the repeated 
study—paragraph condition were instructed to write everything 
from the text on their paper in paragraph format (essentially 
copying the text). In both conditions, students were told to include 
all of the ideas from the texts. Texts were presented on the 
computer screen, and students completed the concept mapping or 
paragraph activities on paper. The subjective rating procedures and 
the final short answer test procedures were identical to those used 
in Experiment 1. 


Results 


An initial analysis indicated that there were no differences 
among the counterbalancing orders, and the levels of performance 
and patterns of results were the same for the two texts. Thus, the 
results have been collapsed across counterbalancing orders and 
texts. 

Scoring. The texts were divided into 40 idea units for scoring 
purposes, and the scoring procedure used in Experiment 1 was 
used in Experiment 2. Two independent raters scored all recall 
protocols and short-answer tests, and a third rater resolved all 
discrepancies to achieve 100% agreement. 

Learning performance. The right portion of Table 1 shows 
the mean proportion of idea units produced in each period in the 
initial learning phase in Experiment 2. Collapsed across condi- 
tions, the proportion of ideas produced increased from Period 1 to 
Period 2 (.38 vs. .52), (79) = 11.59, d = 1.33, 95% CI [1.02, 
1.62]. Students in the repeated study condition (who viewed the 
texts during the concept map and paragraph activities) produced 
more ideas than did students in the retrieval practice conditions. 
This was true for both activity formats in Period 1 (.50 vs. .25), 
(78) = 8.25, d = 1.85, 95% CI [1.32, 2.36], and Period 2 (.60 vs. 
44), (78) = 5.06, d = 1.13, 95% CI [0.65, 1.60]. In the repeated 
study conditions, there were very small differences in the propor- 
tion of ideas produced in the concept map and paragraph formats 
in Period | (M = 0.48 vs. 0.53), 138) = 1.07, d = 0.34, 95% CI 
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[-0.29, 0.96], or in Period 2 (M = 0.58 vs. 0.62), (38) = 0.70, d = 
0.31, 95% CI [-0.31, 0.93]. However, as in Experiment 1, students 
tended to recall more ideas in the paragraph condition than in the 
concept map condition. There was a small difference in Period 1, 
(.27 vs. .24), (38) = 0.56, d = 0.18, 95% CI [—0.44, 0.80], and a 
larger difference in Period 2, (.48 vs. .39), (38) = 1.96, d = 0.62, 
95% CI [—0.02, 1.25]. In a later section, we report an analysis of 
the role of idea unit importance in students’ performance. 

Final short-answer performance. Figure 3 shows perfor- 
mance on the final short-answer test 1 week after the initial 
learning phase. In general, students in the retrieval practice con- 
ditions (without the text available) performed better than students 
in the repeated study conditions (with the text available), but 
whether the activity was in concept map or paragraph format made 
little difference for long-term retention. 

For verbatim questions, collapsed across activity formats, stu- 
dents in the retrieval practice (no text) conditions outperformed 
students in the repeated study (with text) conditions (.48 vs. .38), 
(78) = 2.22, d = 0.50, 95% CI [0.05, 0.94]. In the retrieval 
practice condition, there was a small difference between the para- 
graph and concept map formats, favoring the paragraph format, as 
was the case in Experiment 1 (.49 vs. .46), (38) = 0.50, d = 0.16, 
95% CI [—0.46, 0.78]. However, in the repeated study condition, 
there was a larger difference between activity formats, favoring the 
concept map format over the paragraph format (.43 vs. .33), 
(38) = 1.59, d = 0.50, 95% CI [—0.13, 1.13]. This result supports 
the idea that creating a concept map while studying a text afforded 
elaborative encoding, as concept mapping enhanced long-term 
retention relative to essentially copying the text in the repeated 
study—paragraph condition. 

The pattern of results was similar for the inference questions. 
Collapsed across activity formats, students in the retrieval practice 
conditions outperformed students in the repeated study conditions 
(.39 vs. .31), (38) = 2.07, d = 0.46, 95% CI [0.02, 0.91]. In the 
retrieval practice condition, there was almost no difference be- 
tween the paragraph and concept map formats (.40 vs. .39), 
(38) = 0.27, d = 0.09, 95% CI [—0.53, 0.70]. Likewise, there was 
almost no difference between activity formats in the repeated study 
condition (.30 vs. .32), (38) = 0.30, d = 0.09, 95% CI [-0.52, 
0.71], a result that is somewhat surprising in light of the advantage 
of concept mapping seen in the verbatim questions, as reported 
earlier. 
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Subjective ratings. The right panel of Figure 3 shows stu- 
dents’ judgments of learning, which were made at the end of each 
task in the learning phase. Collapsed across activity formats, 
judgments of learning were higher in the repeated study conditions 
relative to the retrieval practice conditions (.56 vs. .48), #(78) = 
1.32, d = 0.30, 95% CI [-0.15, 0.74]. Although the effect was 
small in the present experiment, the finding that students believed 
they had learned more after repeatedly studying than after prac- 
ticing retrieval is consistent with a wealth of prior work (e.g., 
Agarwal et al., 2008; Karpicke & Blunt, 2011; see Karpicke, 2012, 
for review). In the repeated study condition, students’ judgments 
of learning were higher in the concept map condition than in the 
paragraph condition (.61 vs. .50), (38) = 1.46, d = 0.46, 95% CI 
[-0.17, 1.09]. In the retrieval practice condition, the opposite 
pattern occurred: students’ judgments of learning were higher in 
the paragraph condition than in the concept map condition (.53 vs. 
A3), (38) = 1.27, d = 0.51, 95% CI [—0.13, 1.13]. 

Table 2 shows students’ additional ratings of their subjective 
experiences in the learning tasks, and here we highlight a few 
findings displayed in the table. Students rated the repeated study— 
concept map condition as most enjoyable and the repeated study— 
paragraph task as least enjoyable (.50 vs. .29), (38) = 2.47, d = 
0.78, 95% CI [0.13, 1.42], which is likely due to boredom asso- 
ciated with simply copying the text in the latter condition. The 
enjoyment ratings of the two retrieval practice conditions fell in 
between the ratings of the two repeated study conditions. A similar 
pattern was observed in the interest ratings: students rated the 
repeated study—concept map task as most interesting and the 
repeated study—paragraph task as least interesting (.55 vs. .32), 
(38) = 2.99, d = 0.94, 95% CI [0.28, 1.59], and the interest 
ratings of the two retrieval practice conditions fell in between the 
ratings of the two repeated study conditions. Finally, collapsed 
across activity formats, the retrieval practice tasks were rated as 
more difficult than the repeated study tasks (.54 vs. .35), 1(78) = 
3.93, d = 0.89, 95% CI [0.42, 1.34]. 

Conditional analysis. The right portion of Table 3 shows the 
results of an analysis of the relationship between initial learning 
performance and final short-answer performance in Experiment 2. 
As in Experiment 1, short-answer questions were coded based on 
the idea unit or units required to answer the questions. Verbatim 
questions typically required access to a single idea unit (M = 1.5 
idea units per verbatim question). For example, the question “What 
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Figure 3. Final short-answer performance for verbatim questions and inference questions (left and middle 
panels), and judgments of learning (right panel) in Experiment 2. Error bars represent standard errors of the 


means. 
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do proteins lose at high temperatures?” corresponded to the idea 
unit “Proteins lose their structure at high temperatures.” Inference 
questions required access to multiple idea units (M = 2.9 idea units 
per inference question). For example, the question “What happens 
to catalytic activity if temperature decreases?” relies on the fol- 
lowing idea units: (a) “Catalytic activity is greatly affected by 
temperature”; (b) “Increasing temperature will also increase the 
amount of free energy”; (c) “This results in an increased rate of 
collision”; and (d) “[This] leads to a faster reaction time.” 

First, we analyzed the relationship between initial learning per- 
formance and final short-answer performance, collapsing across 
activity format (concept map vs. paragraph). As shown in Table 3, 
the probability of recalling an idea but then failing to answer a 
short-answer question (intertest forgetting; C,N.) was greater in 
restudy conditions than in the retrieval practice conditions, 1(78) = 
7.25, d = 1.62, 95% CI [1.11, 2.12]. Likewise, the probability of 
not recalling an idea but then correctly answering a final question 
(N,C;) was greater in retrieval practice conditions than in restudy 
conditions, #(78) = 4.99, d = 1.12, 95% CI [0.64, 1.58]. There 
were small differences across’ conditions in C,C, 478) = 0.60, 
d = 0.13, 95% CI [—0.31, 0.57], and N,N,, t(78) = 0.88, d = 0.20, 
95% CI [—0.24, 0.64]. 

The pattern of results within the retrieval practice conditions 
(comparing the concept map format with the paragraph format) 
replicated the results of Experiment 1. The joint probability of 
recalling an idea initially and correctly answering a final short- 
answer question (C,C,) was slightly greater in the paragraph 
condition than in the concept map condition, #38) = 1.46, d = 
0.46, 95% CI [—0.17, 1.09]. Likewise, the probability of not 
recalling an idea but then correctly answering a final question 
(N,C;) was slightly greater in the concept map condition than in 
the paragraph condition, #(38) = 1.63, d = 0.52, 95% CI [-0.12, 
1.14]. There was a small difference in intertest forgetting (the 
probability of recalling an idea but then failing to answer a short- 
answer question; C,N,) across conditions, with those in the para- 
graph condition showing slightly less forgetting, (38) = 0.91 d= 
0.29, 95% CI [—0.34, 0.91]. Finally, the proportion of ideas not 
recalled or expressed on either test (N,N,) was slightly greater in 
the concept map condition relative to the paragraph condition, 
(38) = 0.93, d = 0.29, 95% CI [-0.33, 0.92]. 

Initial recall and normative importance. As in Experiment 
1, 16 independent raters, who had not served as raters or subjects 
in Experiments | or 2, rated the importance of each idea unit in the 
two texts used in Experiment 2, using a scale from | (not important 
at all) to 5 (very important). The average importance rating was 
calculated for each idea unit, and the intraclass correlation among 
the average ratings was .80. In the retrieval practice condition, the 
mean importance rating of the idea units that students recalled was 
greater in the concept map condition (MV = 3.60, SE = 0.02) than 
in the paragraph condition (M = 3.48, SE = 0.03), #(38) = 3.00, 
d = 0.95, 95% CI [0.29, 1.60]. However, in the repeated study 
condition, there was a smaller difference between the mean im- 
portance ratings in the concept map (M = 3.54, SE = 0.02) and 
paragraph conditions (MV = 3.50, SE = 0.02), #38) = 1.36, d = 
0.43, 95% CI [-0.20, 1.05]. Thus, as in Experiment 1, when 
students practiced retrieval, they tended to include ideas with 
higher normative importance ratings in the concept map conditions 
than in the paragraph conditions, though this difference was much 


smaller when students completed the activities with the materials 
in front of them. 


Discussion 


Experiment 2 showed that actively retrieving material during 
learning, either by creating concept maps or by writing the material 
in paragraph format, enhanced long-term retention more than 
completing the same activities in the presence of the materials (as 
study activities). Practicing retrieval produced more learning than 
repeated studying even though students re-experienced the entire 
set of material in the repeated study conditions, whereas students 
only re-experienced what they could recall in the retrieval practice 
conditions. Indeed, the proportion of ideas recalled in the retrieval 
practice conditions was lower than the proportion of ideas pro- 
duced on the concept map or paragraph protocols in the repeated 
study conditions. It is important to note that the concept map and 
paragraph formats were equally effective as retrieval practice 
activities. As in Experiment 1, there were no additional benefits 
conferred by retrieval-based concept mapping beyond practicing 
retrieval in paragraph format. There was a small cost to retrieval- 
based concept mapping in the initial recall periods, on which 
students recalled fewer ideas in the concept map condition than in 
the paragraph condition. However, this cost was not seen on the 
final delayed assessments of long-term retention. Together with 
Experiment 1, the results of Experiment 2 show that concept 
mapping can serve as an effective learning task when it is imple- 
mented as a retrieval-based learning activity. 


General Discussion 


The purpose of the present experiments was to examine the 
effectiveness of retrieval-based concept mapping. The results show 
that the critical factor in retrieval-based learning is requiring 
students to think back to and recall material, while the format in 
which information is retrieved (concept map or paragraph format) 
did not much matter. We review three important findings from the 
present experiments in light of hypotheses proposed in the intro- 
duction. 

First, concept mapping and paragraph formats were equally 
effective retrieval-based learning activities. When students created 
retrieval-based concept maps of the materials, there were no prac- 
tical differences, relative to recalling in paragraph format, on 
delayed short-answer performance in Experiment | or 2. Further- 
more, Experiment 2 showed that both activity formats produced 
retrieval practice effects: Students performed better on a final test 
when the initial activities required retrieval (in the absence of the 
texts) rather than studying or elaborating on the material (in the 
presence of the texts). This advantage of retrieval practice occurred 
even though students in the retrieval conditions produced less 
material during the initial learning activities relative to students in 
the repeated study conditions. 

Second, retrieving in paragraph format produced greater long- 
term performance relative to restudying and rewriting the material 
in paragraph format. It is reasonable to wonder whether the locus 
of retrieval practice effects rests in the act of writing itself, rather 
than in the mental activity of retrieving and reconstructing knowl- 
edge. If this were the case, the repeated study—paragraph condition 
in Experiment 2 should have produced long-term performance 
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similar to that produced by the retrieval practice—paragraph con- 
dition. Indeed, because students were able to re-experience the 
entire set of material in the repeated study condition, one might 
expect that condition to outperform the retrieval practice condition. 
However, the opposite result occurred in Experiment 2, confirming 
that the act of retrieving knowledge itself, rather than the act of 
writing, drives the benefits seen in retrieval-based learning activ- 
ities, 

Third, students generally believed they had learned more after 
repeatedly studying than after practicing retrieval. This result is 
consistent with a wealth of prior research (see Karpicke, 2012) and 
is also broadly consistent with a cue utilization approach to meta- 
cognitive judgments (e.g., Koriat, 1997). According to this view, 
students base their judgments of learning in part on the ease of 
processing they experience during a learning activity. When stu- 
dents complete activities with the text in front of them, processing 
is fluent and easy, whereas when students complete activities 
without the text, they base their judgments on the ease or difficulty 
with which the material can be brought back to mind during 
retrieval. Thus, repeated study activities tend to afford overconfi- 
dent judgments of learning, whereas retrieval practice leads to 
underconfident judgments. In Experiment 2, students rated concept 
mapping as more interesting and enjoyable than studying by copy- 
ing the text in paragraph form, but students’ ratings did not differ 
among concept map and paragraph formats when completed as 
retrieval activities. Despite some speculation that concept mapping 
might somehow promote or improve metacognitive performance 
(e.g., Mintzes et al., 1997), the present experiments offer no 
evidence that this is true (see too Karpicke & Blunt, 2011). 

The key finding from the present experiments was that retrieval 
practice was equally effective when done in concept map or 
paragraph format. Students did not gain additional benefits by 
retrieving knowledge in concept map format relative to retrieving 
in paragraph format. Concept mapping is assumed to promote 
organizational or relational processing that should improve learn- 
ing, but our results are consistent with the possibility that such 
organizational processing may be redundant with the processing 
people already engage in when practicing retrieval in other ways. 
Furthermore, practicing retrieval in concept map format did not 
impair learning relative to practicing retrieval in paragraph format. 
This finding suggests that the concept mapping task did not intro- 
duce extra cognitive load or divide attention in ways that were 
detrimental to learning. When students retrieved in concept map 
format, they tended to recall fewer ideas than when they retrieved 
in paragraph format, because they selectively reported ideas that 
were rated as most important. However, this was not detrimental to 
long-term learning either. Thus, the present experiments support 
the conclusion that concept mapping can indeed function as an 
effective learning activity when it involves practicing retrieval. 


Conclusion 


Retrieval practice is a powerful way to enhance long-term 
meaningful learning of educationally relevant content. The present 
results show that practicing retrieval, either by creating concept 
maps or by writing down the material in paragraph format, en- 
hanced long-term learning more than completing the same tasks as 
study activities. The locus of these learning effects was in the act 
of retrieving knowledge, rather than the mere act of writing down 


the material in paragraph or concept map format. It is important to 
note that the results show that concept mapping can indeed serve 
as an effective task when it is implemented as a retrieval-based 
learning activity. The key element for promoting meaningful learn- 
ing was not the format of the activity; it was the requirement to 
engage in active retrieval practice during learning. 
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Appendix 


Examples of Verbatim and Inference Questions Used in Experiment 1° and Experiment 2 


Experiment 1. Sample questions from text on “Make-Up 
of Human Blood”: 


Verbatim question: 
“What happens when hemoglobin combines with oxygen?” 
(Sample answer: Oxygen is released to cells in the body.) 


Inference question: 

“What would happen to blood flow from a wound if the body did 
not have fibrin?” 
(Sample answer: Blood would not clot, because fibrin is needed to 
form a meshwork of fibers that trap blood cells and aid in clotting.) 


Experiment 2. Sample questions from text on 
“Enzymes”: 


Verbatim question: 
“What are two forms of free energy?” 
(Sample answer: Heat and kinetic energy.) 


Inference question: 

“What happens to catalytic activity if temperature decreases?” 
(Sample answer: Catalytic activity decreases because increasing 
temperature increases the rate of molecular collision, which leads 
to a faster reaction time.) 


“For a complete set of questions, see Karpicke & Blunt (2011). 
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Can Parents’ Involvement in Children’s Education Offset the Effects of 
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Early Insensitivity on Academic Functioning? 
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Data from the National Institute of Child Health and Human Development Study of Early Child Care and 
Youth Development (V 1,312) were analyzed to examine whether the adverse effects of early 
insensitive parenting on children’s academic functioning can be offset by parents’ later involvement in 
children’s education. Observations of mothers’ early insensitivity (i.e., 6-54 months) interacted with 
teachers’ reports of parents’ later involvement (i.e., 1st-5th grade) in predicting children’s academic 
functioning as reflected in observed classroom engagement and performance on standardized achieve- 
ment tests at the end of elementary school (i.e., Sth grade): Although mothers’ insensitivity foreshadowed 
dampened academic functioning among children when parents’ involvement was relatively low, it did not 


do so when parents’ involvement was average or higher. 


Keywords: academics, achievement, parent involvement, parent sensitivity, parenting 


Insensitive parenting (i.e., unresponsiveness, hostility, and in- 
trusiveness) early in children’s lives appears to undermine chil- 
dren’s engagement and achievement in the academic arena not 
only when children first enter school but also throughout child- 
hood, adolescence, and even into adulthood (e.g., Fraley, Roisman, 
& Haltigan, 2013; Raby, Roisman, Fraley, & Simpson, 2013; 
Stams, Juffer, & van IJzendoorn, 2002). A key question is whether 
aspects of children’s later environment can offset these costs of 
early insensitivity. There is much evidence suggesting that parents’ 
involvement in children’s education (e.g., volunteering at school, 
attending parent-teacher conferences, and discussing school with 
children) is instrumental in promoting children’s academic func- 
tioning (for a review, see Pomerantz, Moorman, & Cheung, 2012). 
Hence, such involvement may be an important aspect of children’s 
later environment that can offset the academic problems associated 
with early insensitive parenting. The goal of the current research 
was to evaluate whether parents’ involvement in children’s edu- 
cation can compensate for the adverse effects of early insensitive 
parenting on children’s academic functioning. 
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Parents’ Early Insensitivity 


Insensitive parenting has been conceptualized by the National 
Institute of Child Health and Human Development [NICHD] Early 
Child Care Research Network ([ECCRN] 1997, 2004, 2008) and 
others (e.g., Campbell, Matestic, von Stauffenberg, Mohan, & 
Kirchner, 2007; Stams et al., 2002) as parents’ unresponsiveness to 
children’s nondistress signals, hostility (vs. warmth) toward chil- 
dren, and intrusiveness (vs. autonomy support). Parents’ insensi- 
tivity has been argued to undermine children’s academic function- 
ing through several mechanisms. For example, when parents fail to 
respond contingently to children and are intrusive, children come 
to feel helpless in affecting their environment (e.g., Nolen- 
Hoeksema, Wolfson, Mumme, & Guskin, 1995; Riksen-Walraven, 
1978). Thus, children disengage when confronted with challenge 
in that they are less attentive, self-directed, and persistent, which 
may interfere with their learning (e.g., Bornstein & Tamis- 
LeMonda, 1997; Frodi, Bridges, & Grolnick, 1985; NICHD EC- 
CRN, 2008). The case has also been made that when parents are 
insensitive, children fail to develop a secure attachment (De Wolff 
& van IJzendoorn, 1997); as a consequence, they do not view their 
caregiver as a reliable source of support, which undermines their 
willingness to engage in potentially distressing but cognitively 
stimulating behaviors such as active exploration of the environ- 
ment and persistence in the face of challenge (e.g., Bretherton, 
1985: Main, 1983; Matas, Arend, & Sroufe, 1978; for additional 
mechanisms by which attachment contributes to children’s aca- 
demic adjustment, see van IJzendoorn, Dijksta, & Bus, 1995). 

Consistent with these ideas, exposure to early insensitivity ap- 
pears to disrupt the development of a foundation for children’s 
later academic functioning. For example, in the first year of life, 
children whose mothers are insensitive in that they are unrespon- 
sive or intrusive exhibit dampened attentiveness and persistence 
(e.g., Bornstein & Tamis-LeMonda, 1997; Frodi, Bridges, & Grol- 
nick, 1985). Moreover, early insensitive parenting predicts poorer 
cognitive skills during the preschool years, even after taking into 
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account children’s earlier cognitive skills (Lemelin, Tarabulsy, & 
Provost, 2006). Recent evidence suggests that such effects are not 
simply due to shared genetics (Roisman & Fraley, 2012). More- 
over, the deleterious effects of early insensitive parenting on 
children’s academic functioning appear to endure into adoles- 
cence. Using the NICHD Study of Early Child Care and Youth 
Development (SECCYD), Fraley and colleagues (2013) demon- 
strated that the predictive significance of mothers’ early (1.e., 6-36 
months) insensitivity on children’s achievement remained rela- 
tively constant throughout childhood and into adolescence, even 
after accounting for the stability of mothers’ insensitivity across 
these years of development as well as potential confounds such as 
mothers’ educational attainment, children’s race, and family in- 
come. Similar findings were recently observed in the Minnesota 
Longitudinal Study of Risk and Adaptation through age 32 with a 
focus on academic attainment (Raby et al., 2013). 


The Compensatory Role of Parents’ Involvement in 
Children’s Education 


By becoming involved in children’s education, parents can 
offset the adverse effects of early insensitive parenting on chil- 
dren’s academic functioning. Parents’ involvement in children’s 
education has been argued to foster the psychological resources 
necessary for children’s optimal academic functioning (for re- 
views, see Pomerantz & Moorman, 2010; Pomerantz, Moorman, & 
Cheung, 2012). For example, such involvement may highlight the 
value of learning to children, which may heighten their engage- 
ment in school, thereby enhancing their achievement (e.g., Epstein, 
1988; Grolnick & Slowiaczek, 1994). The case has also been made 
that by providing additional instruction or opportunities for prac- 
tice, parents’ involvement develops important academic compe- 
tencies among children (e.g., Cheung & Pomerantz, 2011; Se- 
nechal & LeFevre, 2002). Parents’ involvement on the school front 
may lead to enhanced learning among children by increasing the 
attention children receive from teachers (Epstein & Becker, 1982; 
for additional mechanisms by which involvement contributes to 
children’s academic adjustment, see Pomerantz et al., 2012). The 
benefits of parents’ involvement are evident even when parents are 
insensitive as reflected in intrusiveness (Cheung & Pomerantz, 
2011). 

Because children exposed to insensitive parenting early in their 
lives are academically at risk—often lacking critical competencies 
for achievement in this context—they may be in much need of the 
resources provided by parents’ involvement in children’s educa- 
tion. Thus, they may be particularly likely to benefit from parents’ 
involvement, such that over time parents’ involvement can com- 
pensate for the costs of early insensitivity. Suggestive of this 
possibility, parents’ involvement can offset another aspect of chil- 
dren’s environment that appears to undermine children’s academic 
functioning. Dearing and colleagues (2006) found that achieve- 
ment disparities between children of less versus highly educated 
mothers were moderated by parents’ involvement in children’s 
education (see also Dearing, McCartney, Weiss, Kreider, & Simp- 
kins, 2004). Specifically, during the early elementary school years, 
when parents’ school-based involvement was low, children with 
less educated mothers had poorer literacy achievement than did 
children with more educated mothers, but when such involvement 
was high, this difference was not evident. Parents’ involvement 


may play a similar compensatory role when it comes to early 
insensitive parenting: When parents are involved in children’s 
education, the academic problems associated with early insensitive 
parenting may be reduced. 

Although many insensitive parents are not involved in chil- 
dren’s education, a sizable number are (for associations between 
insensitivity—as manifest in intrusiveness and hostility—and in- 
volvement, see Cheung & Pomerantz, 2011; Pomerantz, Wang, & 
Ng, 2005; Steinberg, Lamborn, Dornbusch, & Darling, 1992). 
Sensitive parenting may require a somewhat different set of skills 
and values than does involvement in children’s education. For 
example, parents’ sensitivity entails the capacity to show warmth 
as well as a concern with children’s psychological needs, whereas 
parents’ involvement entails the capacity to monitor children’s 
progress in school as well as a concern with children’s perfor- 
mance. Once children enter the formal school system, teachers 
may elicit parents’ involvement on the school front (e.g., commu- 
nicating with teachers or volunteering at school) via invitations 
(e.g., Green, Walker, Hoover-Dempsey, & Sandler, 2007). The 
benefits of school-based involvement may not be appreciably 
dampened by insensitivity because such involvement requires rel- 
atively little interaction between children and parents, but can still 
provide important resources—for example, by conveying that 
school is valuable. Indeed, the positive effects of school-based 
involvement are more consistent than those of home-based in- 
volvement (e.g., assisting children with homework and discussing 
school with children), which almost always entails interaction 
between children and parents (Pomerantz, Moorman, & Litwack, 
2007). Nonetheless, because home-based involvement also pro- 
vides important resources (e.g., instruction and practice), both 
home- and school-based involvement may play a compensatory 
role. 


Overview of the Current Research 


The key hypothesis guiding this research was that the adverse 
effects of early insensitive parenting on children’s later academic 
functioning can be offset by parents’ involvement in children’s 
education. This notion reflects one of the central tenets of Bron- 
fenbrenner’s (1992) ecological systems theory: The influence of 
children’s proximal environment—that is, the microsystem as 
manifest in parents’ early sensitivity—is shaped in part by the 
broader environment—in this case, the mesosystem as manifest in 
parents’ involvement in children’s education given that it includes 
interactions with school personnel. We focused on parents’ in- 
volvement during the elementary school years (i.e., first to fifth 
grade) because there is often substantial opportunity for parents to 
become involved on the school front at this time. Parents’ involve- 
ment was assessed with teacher reports explicitly referencing par- 
ents’ involvement on the school front given that such involvement 
is not only particularly reflective of the mesosystem but also may 
be minimally influenced by parental insensitivity (Pomerantz et 
al., 2007). The measure also asked about parents’ involvement in 
children’s education in general, which may capture parents’ in- 
volvement on the home, as well as school, front. 

We analyzed data from the NICHD SECCYD. Observations of 
mothers’ insensitive parenting prior to children’s entry into the 
formal education system (i.e., 6-54 months) were used as mea- 
sures of early insensitivity. Once children entered first grade, 


PARENTAL INVOLVEMENT 861 


teachers reported every year on parents’ involvement in children’s 
education. The annual reporting allowed us to examine parents’ 
involvement over an extended period of time (i.e., 5 years), which 
is of import given that years of insensitive parenting prior to 
children’s entry into school are unlikely to be immediately undone 
by a short phase (e.g., 1 year during first grade) of involvement. 
The effects of these two aspects of children’s environment on 
children’s academic functioning at the end of elementary school 
(i.e., fifth grade) were examined, adjusting for such functioning as 
children entered elementary school (i.e., first grade). This allowed 
us to rule out the possibility that the effects of parents’ involve- 
ment were due simply to developmental processes in place before 
children entered elementary school. To identify the breadth of the 
compensatory role of parents’ involvement, we investigated mul- 
tiple forms of children’s academic functioning, which were as- 
sessed using diverse methods (i.e., observations of engagement in 
the classroom, standardized achievement test performance, and 
teachers’ reports of academic competencies). 

Two major steps were taken to rule out alternative explanations. 
First, it is possible that early insensitive parenting followed by 
parents’ involvement in children’s education represents a change 
in insensitivity among parents such that by the elementary school 
years, parents have become less insensitive. Thus, we adjusted for 
mothers’ insensitivity during elementary school. We also evalu- 
ated whether it was insensitivity at this time rather than involve- 
ment that offsets the adverse effects of early insensitivity. Second, 
because parents’ involvement in children’s education can compen- 
sate for low educational attainment among mothers (e.g., Dearing 
et al., 2006), and insensitive parenting is particularly common 
when socioeconomic resources are low (e.g., Linver, Brooks- 
Gunn, & Kohen, 2002; NICHD ECCRN, 2005), we examined 
whether the interactive effects of early insensitivity and elemen- 
tary school involvement are unique or simply reflect the interactive 
effects of educational attainment and involvement. 


Method 


Participants 


Families were recruited for the NICHD SECCYD from hospi- 
tals in 10 locations (Little Rock, AR; Irvine, CA; Lawrence, KS; 
Boston, MA; Philadelphia, PA; Pittsburgh, PA; Charlottesville, 
VA; Morganton, NC; Seattle, WA; Madison, WI) shortly after 
mothers gave birth (for sampling and recruitment details, see 
http://secc.rti.org). The resulting sample consisted of 1,364 chil- 
dren and their mothers. The current analyses used assessments 
from Phases | (birth to 3 years), 2 (54 months to first grade), and 
3 (second to sixth grade). The analytic sample was restricted to 
dyads with data available for at least one of the assessments of 
maternal insensitivity or parental involvement examined in the 
current analyses (V = 1,312). This sample was predominantly 
(81%) White (13% African American, 2% Asian, and 5% other 
minority groups); 52% of children were boys. At Phase 1, there 
was a range of educational attainment among mothers: 307% had a 
high school degree or less, 55% had completed some college or 
earned a college degree, and 15% had completed some graduate 
work or earned a graduate degree. Most (84%) families had 
income-to-needs ratios classified as not poor (i.e., = 1). 


Measures 


Table 1 presents descriptive statistics for the measures used in 
the current report. 

Maternal insensitivity. Early maternal insensitivity was as- 
sessed in the context of mother—child interactions during semi- 
structured play. Mothers were provided age-appropriate toys de- 
signed to elicit joint play and instructed to play with children using 
the toys in a specific order, or to use the toys to accomplish a 
specific task (e.g., completing a maze on an etch-a-sketch toy). 
The interactions were videotaped in the home at 6 and 15 months 
and in the laboratory at 24, 36, and 54 months. At the 6-, 15-, and 
24-month observations, mothers’ behavior was coded for sensitiv- 
ity to nondistress (i.e., responsiveness to child’s signals), positive 
regard (i.e., expressions conveying positive feelings toward child), 
and intrusiveness (i.e., controlling behaviors) using a 4-point scale 
(from 1 = not at all characteristic of the interaction to 4 = highly 
characteristic of the interaction). Maternal insensitivity was com- 
puted as the sum of the three ratings, with sensitivity to nondistress 
and positive regard reverse scored (as = .70—.79). At the 36- and 
54-month observations, mothers’ behavior was coded for support- 
ive presence (i.e., positive assistance and emotional support), 
hostility (1.e., angry or rejecting behaviors), and respect for auton- 


Table | 
Descriptive Statistics 


Observed 
Measure M (SD) range 

Early maternal insensitivity 

6 months 9.21 (1.78) 3-12 

15 months 9.40 (1.65) 3-12 

24 months 9.35 (1.76) 3-12 

36 months 17.19 (2.78) 42] 

54 months 16.95 (2.91) 4-21 
Elementary school maternal 

insensitivity 

Ist grade 16.88 (3.03) 5-21 

3rd grade 16.34 (2.49) 4-21 

5th grade 16.50 (2.42) 7-21 
Elementary school parent involvement 

Ist grade 3:95)(698) 1-5 

2nd grade 3.55 (.96) 1-5 

3rd grade 3.33 (.94) 1-5 

4th grade 3.30 (.92) 1-5 

Sth grade 3.25 (.94) 1=5 
Elementary school classroom 

engagement 
Ist grade 55.91 (4.73) 28-60 
Sth grade 40.92 (8.43) 11.25-59.00 


Elementary school Woodcock-Johnson 
achievement tests 


Ist grade 477.15 (10.82) 432.00-509.71 

5th grade 508.46 (11.66) 417.50-540.60 
Elementary school academic skills 

Ist grade 3.28 (.90) 1-5 

Sth grade 3.46 (.85) 1-5 
Elementary school performance 

Ist grade current 3.41 (.84) 1-5 

5th grade current 3.48 (.96) 1-5 
Covariates 

Maternal education 14.28 (2.50) 7-21 

Income-to-needs ratio 3.60 (2.85) 0.15-27.36 

Maternal depression 9.36 (6.76) 0-43 
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omy (i.e., acknowledgement and support of child’s intentions and 
independence) using a 7-point scale (from | = not at all charac- 
teristic of the interaction to 7 = highly characteristic of the 
interaction). Maternal insensitivity was computed as the sum of 
the three scales, with supportive presence and respect for auton- 
omy reverse scored (as = .78-.84 interrater reliability: rs = 
.71—.87, ps < .001). The associations among maternal insensitivity 
across the five measurement points were sizable (rs = .30-.52, 
ps < .001); together, the five formed a reliable composite score 
(a = .76). Thus, a single index was created by taking the mean of 
mothers’ standardized insensitivity scores prior to children’s for- 
mal schooling (i.e., at 6, 15, 24, 36, and 54 months), with higher 
numbers reflecting heightened maternal insensitivity. 

Elementary school maternal insensitivity was assessed when 
children were in first, third, and fifth grade during videotaped 
sessions in the laboratory. In first grade, insensitivity was assessed 
in the context of a semistructured mother-child play task. In third 
and fifth grades, insensitivity was assessed in the context of 
mothers and children discussing areas of disagreement and work- 
ing on a planning task. Mothers’ behavior was coded for support- 
ive presence, hostility, and respect for autonomy (interrater reli- 
ability: rs = .72-.83). At each time point, the sum of the three 
scales was taken, with supportive presence and respect for auton- 
omy reverse scored (a = .80—.85). A single index was created by 
taking the mean of mothers’ standardized insensitivity scores at the 
first, third, and fifth grades, which were sizably associated with 
one another (rs = .43-.47, ps < .001; a = .71). 

Parental involvement. Teachers completed the Parent- 
Teacher Involvement Questionnaire (Kohl, Lengua, McMahon, & 
Conduct Problems Prevention Research Group, 2000; Miller- 
Johnson, Maumary-Gremaud, & Conduct Disorders Research 
Group, 1995) each year from the time children were in first 
through fifth grade. Four of the 10 items comprising the question- 
naire ask about parents’ involvement in children’s education ex- 
plicitly on the school front (e.g., “How often does this parent 
volunteer or visit at school?” and “How often does this parent send 
things to class like story books or objects?”); four ask about 
parents’ involvement more generally (e.g., “How involved is this 
parent in his/her child’s education and school life?” and “How 
important is education in this family?”). Two items do not directly 
assess parental involvement, but rather the relationship between 
parents and teachers (“How well do you feel you can talk to and be 
heard by this parent?” and “If you had a problem with this child, 
how comfortable would you feel talking to his/her parent about 
it?”); thus, they were omitted for the current analyses. Teachers 
rated items on a 5-point scale (from 1 = not at all to 5 = a great 
deal). Scores were computed by taking the mean of the eight items 
assessing parental involvement (as = .91—.92), with higher scores 
indicating heightened involvement. A single index was created by 
taking the mean of the scores from first through fifth grades, which 
were sizably associated (rs = .54—.64, ps < .001) and together 
formed a reliable composite (a = .88). 

Child academic functioning. Multiple forms of children’s 
academic functioning were examined. We used assessments at the 
beginning (i.e., first grade) and end (i.e., fifth grade) of elementary 
school. Observations of classroom engagement were made using 
the Classroom Observation System, a rating system created spe- 
cifically for the NICHD SECCYD. Observers recorded the fre- 
quency of children’s behaviors in 10-min periods consisting of 


30-s observe, 30-s record intervals. In first grade, children’s en- 
gagement was computed as the sum of the active and passive 
engagement scales (interrater reliability: r = .82, p < .001), which 
were based on six observe-and-record 10-min periods; in fifth 
grade, children’s engagement was computed as the sum of the 
engaged in learning and highly engaged scales, which were based 
on eight to 11 observe-and-record 10-min periods (interrater reli- 
ability: r = .97, p < .001). Higher scores indicate heightened 
classroom engagement. 

Children’s achievement was assessed with the Woodcock- 
Johnson Tests of Psychoeducational Achievement-Revised (Wood- 
cock & Johnson, 1989). In first grade, children completed four 
subtests of cognitive aptitude: Memory for Names (long-term 
retrieval), Memory for Sentences (short-term memory), Incom- 
plete Words (auditory processing), and Picture Vocabulary (verbal 
comprehension). Children also completed three subtests of 
achievement: Letter-Word Identification (learning and reading), 
Applied Problems (mathematical and practical problem solving), 
and Word Attack (phonic and structural analysis). In fifth grade, 
children completed one subtest of cognitive ability (i.e., Picture 
Vocabulary) and three subtests of achievement (i.e., Letter-Word 
Identification, Applied Problems, and Passage Comprehension). 
Raw scores for each test were converted to W scores, a transfor- 
mation of the Rasch ability scale centered at the value of 500. The 
mean of the standardized W scores on the subtests was taken at the 
first (rs = .25-.85, ps < .001) and fifth (rs = .44-.74, ps < .001) 
grades, with higher numbers indicating higher achievement. 

Teacher reports of academic competencies when children were 
in the first and fifth grades were used. Teachers completed the 
Academic Skills questionnaire from the Early Childhood Longi- 
tudinal Study (Nicholson, Atkins-Burnett, & Meisels, n.d.). In first 
grade, teachers rated children’s ability to perform 25 age- 
appropriate skills on a 5-point scale (from 1 = not yet to 5 = 
proficient; « = .97). Fifteen items captured language and literacy 
skills (e.g., “Reads first grade books fluently”), and 10 captured 
mathematical thinking skills (e.g., “Understands place values’). In 
fifth grade, teachers reported on a comparable set of 23 skills (a = 
.95). Ten items captured language and literacy skills (e.g., “Con- 
veys ideas clearly”) and 13 captured mathematical thinking skills 
(e.g., “Uses a variety of strategies solving math problems”). Scores 
were computed as the mean of the items, with higher scores 
indicating higher proficiency. Teachers also completed the Current 
School Performance subscale of the Child Evaluation question- 
naire (Pierce, Hamm, & Vandell, 1999) when children were in the 
first and fifth grades. Teachers rated children’s performance in 
reading, oral language, written language, math, social studies, and 
science on a 5-point scale (from 1 = below grade level to 5 = 
excellent; ws = .93 and .95). At both the first and fifth grades, the 
Academic Skills and Current School Performance scales were 
substantially associated (rs = .77 and .78, ps < .001); thus, a 
composite index was created by taking the mean of the two, with 
higher numbers reflecting greater competence. 

Covariates. Mothers reported their educational attainment 
(i.e., years of education), child gender, and child race (i.e., Amer- 
ican Indian, Asian, Black, White, or “other”) when children were 
1 month old. Ongoing assessments of family income-to-needs ratio 
and maternal depression were completed throughout the study. To 
compute income-to-needs ratios, family total income was divided 
by the poverty threshold based on family size. Maternal depression 
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was measured with the 20-item Center for Epidemiology Depres- 
sion Scale (Radloff, 1977). To capture the same time frame as the 
assessments of mothers’ early sensitivity, we averaged over the 
assessments when children were 6, 15, 24, 36, and 54 months old 


for family income-to-needs ratio and maternal depression (as = 
.89-.91). 


Results 


Missing Values 


Due to failure to complete all assessments and item nonre- 
sponse, the NICHD SECCYD contains incomplete data. Analyses 
were conducted not only on the original data set with listwise 
deletion but also on 25 imputed data sets created with multiple 
imputation (Rubin, 1987; Schafer & Graham, 2002) using the fully 
conditional specification method in SPSS 19.0 (IBM Corp., 2010). 
The imputation model included all of the variables in the central 
and supplementary analyses, as well as the interaction terms as 
suggested by Enders (2010). Results from the imputed data sets 
were combined, yielding pooled estimates that account for vari- 
ability in estimates between and within the data sets. The pattern 
of results was practically identical for the estimates based on the 
original and imputed data sets (see Table 3). Results from the 
original data set are reported in the text (see also Tables 1 and 2 
and Figures | and 2), with the results from both the original and 
imputed data sets presented in Table 3. 


Preliminary Analyses 


As shown in Table 2, mothers’ insensitivity—both early in 
children’s lives (i.e., 6-54 months) and during elementary school 
(1.e., first, third, and fifth grade)—was inversely associated with 
parents’ involvement in children’s education during elementary 
school (i.e., first to fifth grade). The associations were sizable 
(rs = —.49 and —.43, ps < .001), but far from unity. Indeed, 17% 
(n = 187) of the sample of 1,109 with relevant data were above the 
50th percentile on both early insensitivity and elementary school 
involvement, 33% (n = 369) were above the 50th percentile on 
early insensitivity and below the 50th percentile on elementary 
school involvement, 34% (n = 377) were below the 50th percen- 
tile on early insensitivity and above the 50th percentile on elemen- 
tary school involvement, and 16% (n = 176) were below the 50th 
percentile on both early insensitivity and elementary school in- 
volvement. 

Consistent with prior research, mothers’ early, as well as ele- 
mentary school, insensitivity was associated with children’s aca- 
demic functioning at both first and fifth grade (e.g., mothers’ early 
insensitivity was inversely associated with children’s standardized 
achievement test performance at fifth grade). Also consistent with 
prior research, parents’ elementary school involvement was asso- 
ciated with heightened academic functioning among children at 
both the beginning (i.e., first grade) and end (i.e., fifth grade) of 
elementary school. The covariates (e.g., maternal education and 
depression) were generally associated with parenting and chil- 
dren’s academic functioning, indicating the importance of taking 
them into account. 
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Predicting Children’s Fifth-Grade Academic Functioning From Mothers’ Early Sensitivity and Parents’ Elementary 


School Involvement 
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Central Analyses 


To evaluate whether the adverse effects of early insensitivity on 
children’s later academic functioning can be offset by later in- 
volvement in children’s education, hierarchical multiple regression 
was used. Each form of academic functioning (i.e., observations of 
classroom engagement, standardized achievement test scores, and 
teacher reports of academic competencies) at the end of elemen- 
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Figure J. Parents’ elementary school involvement moderates the effects 
of mothers’ early insensitivity on observations of children’s classroom 
engagement at fifth grade, controlling for engagement at first grade. Slopes 
were estimated from the regression equation from the original data set. 
Low insensitivity and involvement reflects estimates at the 25th percentile; 
average reflects estimates at the 50th percentile; high reflects estimates at 
the 75th percentile. 


For child gender, 1 = male and 2 = female; for child race, 1 = White and 2 = non-White. 


tary school (i.e., fifth grade) was predicted from early insensitivity 
(i.e., 6-54 months) and elementary school involvement (i.e., first 
to fifth grades). To take into account academic functioning in the 
initial school years, academic functioning at first grade was en- 
tered as a covariate in the first step. This step also included the 
other covariates: child gender (1 = male, 2 = female) and race 
(1 = White, 2 = non-White), maternal education (i.e., years of 
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Figure 2. Parents’ elementary school involvement moderates the effects 
of mothers’ early insensitivity on children’ scores on the Woodcock- 
Johnson standardized achievement tests at fifth grade, controlling for 
scores at first grade. Slopes were estimated from the regression equation 
from the original data set. Low insensitivity and involvement reflects 
estimates at the 25th percentile; average reflects estimates at the 50th 
percentile; high reflects estimates at the 75th percentile. 
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education completed) and depression, and the family income-to- 
needs ratio. In the second step, early insensitive parenting was 
entered. Elementary school involvement and insensitivity were 
added in the third step. This was followed by a fourth step 
including the target Early Insensitivity Elementary School In- 
volvement interaction. As suggested by Aiken and West (1991), to 
reduce multicollinearity, the continuous variables were mean- 
centered by standardizing them. 

As shown in Table 3, early insensitivity predicted academic 
functioning at the end of elementary school, taking into account 
such functioning at the beginning of elementary school: The more 
insensitive mothers were early on, the poorer children’s subse- 
quent engagement, standardized test scores, and academic compe- 
tencies (ts = 3.00, ps < .01). It was also the case that the more 
involved parents were in children’s education during elementary 
school, the better children’s engagement, standardized test scores, 
and academic competencies at the end of elementary school (ts = 
2.12, ps < .05). As anticipated, the effects of early insensitivity 
were moderated by elementary school involvement, as indicated 
by Early Insensitivity X Elementary School Involvement interac- 
tions in predicting observations of engagement in the classroom 
(t = 2.90 p < .01) and standardized achievement test performance 
(t = 4.10, p < .001). The Early Insensitivity < Elementary School 
Involvement interaction, however, did not reach significance for 
teachers’ reports of academic competencies (t = 1.88, p = .06). 

The interactions for observed classroom engagement and stan- 
dardized achievement test performance were decomposed follow- 
ing Aiken and West’s (1991) guidelines. To evaluate the effects of 
early insensitivity when elementary school involvement was low, 
we conducted hierarchical multiple regression analyses identical to 
those described earlier, but centering involvement at the 25th 
percentile (i.e., a standardized score of —.67). The effects of early 
insensitivity when involvement was average or high were evalu- 
ated in regression analyses in which involvement was centered at 
the 50th (i.e., a standardized score of .16) and 75th percentile (i.e., 
a standardized score of .79), respectively. As shown in Figure 1, 
when elementary school involvement was low (i.e., the 25th per- 
centile), early insensitivity was predictive of dampened classroom 
engagement at the end of elementary school (B = —.12; t = 2.42, 
p < .05); however, when elementary schoo] involvement was 
average (i.e., the 50th percentile) or high (i.e., the 75th percentile), 
early insensitivity did not matter for children’s classroom engage- 
ment (Bs = —.04 and .02; ts < 1). Similarly, as shown in Figure 
2, early insensitivity predicted lower achievement scores over time 
when elementary school involvement was low (8 = —.07; t = 
2.44, p < .05), but not when it was average (8 = —.01; t = 0.22, 
p = .83) or high (B = .04; t = 1.24, p = .21). 

To identify the extent of involvement necessary to offset the 
adverse effects of early insensitivity, regions of significance anal- 
yses were conducted (see Aiken & West, 1991; Preacher, Curran, 
& Bauer, 2006; Roisman et al., 2012). The regions of significance 
analyses allowed us to identify the specific value of involvement 
(i.e., the moderator) at which early insensitivity no longer pre- 
dicted dampened academic functioning. These analyses were con- 
ducted with the web-based application (http://www. yourpersonality 
net/interaction) created by R. Chris Fraley as a supplement to 
Roisman et al. (2012). The results indicated that the minimum 
value of parents’ involvement necessary to reduce the negative 
association between early insensitivity and observed classroom 


engagement to nonsignificance was a standardized score of —.40, 
which is slightly less than a half standard deviation below the 
mean, representing the 33rd percentile of the sample. For stan- 
dardized achievement test performance, the minimum value was a 
standardized score of —.47, which is almost a half standard devi- 
ation below the mean, representing the 32nd percentile of the 
sample. 

To identify the proportion of children exposed to early insensi- 
tivity who were helped by parents’ involvement in children’s 
education, we identified the threshold of early insensitivity where 
parents’ involvement began to have a significant positive effect— 
this is at slightly above the 50th percentile of early insensitivity. 
We then looked at how many families had scores that fell at or 
above this threshold and at or above the threshold of involvement 
that reduces the negative insensitivity effect to nonsignificance. 
The proportion of families ranged from 16% (for the analyses 
predicting engagement) to 22% (for the analyses predicting 
achievement)—that is, 179 to 241 families of those with the 
relevant data (n = 1,109). 


Supplementary Analyses 


To ensure that the effects we identified reflect the role of 
parents’ involvement during elementary school rather than insen- 
sitive parenting during this time, we conducted hierarchical mul- 
tiple regressions identical to those in the central analyses but 
replacing the Early Insensitivity x Elementary School Involve- 
ment interaction with the Early Insensitivity x Elementary School 
Insensitivity interaction. Although this interaction was not evident 
for observations of classroom engagement (8 = —.02; t = .65, p = 
.52), it was evident for standardized achievement test performance 
(B = —.08; t = 3.75, p < .001) and teachers’ reports of compe- 
tencies (8 = —.06; t = 2.38, p < .05). To ensure that such an 
interaction was not responsible for the Early Insensitivity < Ele- 
mentary School Involvement interaction for standardized achieve- 
ment test performance, the two interactions were entered simulta- 
neously. Both the Early Insensitivity x Elementary School 
Involvement interaction (8 = .06; t = 2.60, p < .01) and the Early 
Insensitivity % Elementary School Insensitivity interaction 
(B = —.05; t = 2.01, p < .05) remained. Thus, the compensatory 
role of elementary school involvement does not appear to be 
attributable to dampened insensitivity during elementary school. 

Because prior research indicates that parents’ involvement in 
children’s education moderates the effect of parents’ educational 
attainment on children’s academic functioning, we wanted to en- 
sure that elementary school involvement moderated the effects of 
early insensitivity over and above maternal educational attainment. 
Thus, regression analyses identical to those used in the central 
analyses were conducted replacing the Early Insensitivity < Ele- 
mentary School Involvement interaction with a Maternal Educa- 
tional Attainment X Elementary School Involvement interaction. 
Consistent with prior research, this interaction was evident for 
observed classroom engagement (8 = —.07; t = 2.08, p < .05) 
and standardized achievement test performance (8 = —.08; t = 
3.99, p < .001); however, it did not reach significance for teacher 
reports of academic competencies (8B = —.05; t = 1.80, p = .07). 
To ensure that the Maternal Educational Attainment * Elementary 
School Involvement interaction did not account for the Early 
Insensitivity < Elementary School Involvement interaction, both 
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were simultaneously entered in the analyses, predicting observa- 
tions of classroom engagement and standardized achievement test 
performance. The Early Insensitivity x Elementary School In- 
volvement interaction remained for engagement (8 = .09; t = 
2.26, p < .05) and test scores (8 = .06; t = 2.64, p < .01). The 
Educational Attainment * Elementary School Involvement inter- 
action remained only for standardized achievement test scores 
(B = —.05; t = 2.47, p < .05). 


Discussion 


The current findings are consistent with the idea that parents’ 
involvement in children’s education can offset the adverse effects 
of early insensitive parenting on children’s academic functioning. 
When parents were relatively uninvolved in children’s education 
during elementary school (i.e., first to fifth grade), the legacy of 
early (i.e., 6-54 months) insensitive parenting was evident in 
deficits in children’s classroom engagement and performance on 
standardized achievement tests at the end of elementary school 
(i.e., fifth grade). However, such effects were no longer statisti- 
cally significant when parents showed average or higher involve- 
ment. Although parents’ involvement in children’s education is 
less common among parents with a history of insensitivity, exam- 
ination of its compensatory role is important because it provides a 
window into its potential power to overcome academic risk among 
children. 


The Compensatory Role of Parents’ Involvement in 
Children’s Education 


Replicating prior research indicating that early insensitive par- 
enting foreshadows academic problems among children (e.g., Fra- 
ley et al., 2013; Lemelin et al., 2006; NICHD ECCRN, 2008; 
Stams et al., 2002), the current research revealed that the more 
insensitive mothers were early in children’s lives (i.e., 6-54 
months), the poorer children’s academic functioning at the end of 
elementary school (i.e., fifth grade) over and above their earlier 
(i.e., first grade) academic functioning. However, this was mod- 
erated by parents’ involvement in children’s education during 
elementary school (i.e., first to fifth grades) such that mothers’ 
early insensitivity predicted dampened subsequent academic func- 
tioning (i.e., fifth grade) among children only when parents were 
relatively uninvolved. Notably, it appeared that merely average (or 
better) involvement was necessary to offset the adverse effects of 
early insensitivity. Our analyses ruled out the possibility that the 
effects of parents’ involvement simply reflected a change in in- 
sensitive parenting such that involved parents with a history of 
insensitive parenting became sensitive over time. However, re- 
maining to be investigated is whether the compensatory role of 
parents’ involvement in children’s education begins earlier than 
the elementary school years: When insensitive parents are in- 
volved early in children’s lives (e.g., by reading to them or 
counting with them), does this protect children? Given that our 
analyses took children’s academic functioning at first grade into 
account, the compensatory role of such early involvement is likely 
distinct from of parents’ involvement during elementary school. 

The results of the current research parallel Dearing and col- 
leagues’ (Dearing, Kreider, Simpkins, & Weiss, 2006; Dearing, 
McCartney, Weiss, Kreider, & Simpkins, 2004) findings that par- 


ents’ involvement in children’s education reduces achievement 
disparities between children of less versus highly educated moth- 
ers. Notably, the compensatory role of parents’ involvement for 
early insensitive parenting is unique to insensitivity in that the 
Early Insensitivity < Elementary School Involvement interaction 
remained when adjusting for the Educational Attainment < Ele- 
mentary School Involvement interaction. Taken together with 
Dearing and colleagues’ findings, the current research suggests 
that parents’ involvement may be powerful in its ability to offset 
multiple aspects of children’s environment that put them at risk 
academically. The findings are also in line with Bronfenbrenner’s 
(1992) ecological systems theory in underscoring that the influ- 
ence of the microsystem (i.e., insensitive parenting) may be mod- 
erated by the connections parents establish outside this system 
(i.e., the mesosystem) as reflected in their involvement in chil- 
dren’s education. 

A major strength of the current research is the use of multiple 
methods to assess parenting and children’s academic functioning. 
The two aspects of parenting examined were assessed using dif- 
ferent methods—observations of mothers’ insensitivity and 
teacher reports of parents’ involvement—with multiple assess- 
ments over the time frames of interest. In addition, we examined 
three forms of children’s academic functioning, each assessed with 
a different method (i.e., observations of classroom engagement, 
performance on standardized achievement tests, and teacher re- 
ports of academic competencies). Given that the compensatory 
effect of parents’ involvement was evident for both observations of 
children’s classroom engagement and children’s performance on 
standardized tests, the effect’s range of influence on academic 
functioning appears to be broad, with no evidence that it simply 
reflects reporter bias or shared method variance. In the case in 
which teachers reported on both parents’ involvement and chil- 
dren’s academic competencies, it may have been difficult to detect 
the target interaction because of the shared reporter bias. Indeed, 
parents’ involvement was a larger predictor of this form of aca- 
demic functioning than it was of observations of children’s en- 
gagement or standardized achievement tests—a pattern that was 
not evident for insensitive parenting (see Table 3). Moreover, 
when we examined only the items asking about parents’ involve- 
ment explicitly on the school front, the interaction predicting 
teachers’ reports of children’s academic competencies reached 
significance. The explicit nature of these items (vs. those that are 
more general) may minimize teachers’ biases in completing the 
parent involvement measure. 

At first blush, the finding that parents’ involvement in children’s 
education offsets the negative effects of early insensitive parenting 
may appear to contradict the argument made by several investiga- 
tors that although the quantity of parents’ involvement in chil- 
dren’s education is important, so is the quality (e.g., Pomerantz, 
Grolnick, & Price, 2005; Pomerantz et al., 2007). Much research 
indicates that parents’ sensitive involvement in children’s educa- 
tion (e.g., supporting children’s autonomy while helping them with 
an academic activity) is beneficial for children’s academic func- 
tioning (e.g., Grolnick, Gurland, DeCourcey, & Jacob, 2002; 
Hokoda & Fincham, 1995; Moorman & Pomerantz, 2008). Two 
studies reveal an interaction between parents’ involvement and 
what might be considered insensitive parenting opposite to the one 
identified here. Focusing on kindergarten students, Simpkins and 
colleagues (2006) found that parents’ school-based involvement 
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was positively associated with children’s achievement only when 
there was a warm parent-child relationship. Similarly, in a study 
with adolescents, parents’ involvement on the school and home 
fronts was more predictive of achievement, but not necessarily 
engagement, the more parents used an authoritative parenting style 
(Steinberg et al., 1992). 

These two studies examined parenting quality and parents’ 
involvement in children’s education at the same time; the current 
study, however, examined the two at different times—parenting 
quality in the years before children entered school and parents’ 
involvement when children were in elementary school. In addition, 
in contrast to these two studies, recent research indicates that even 
when parents’ involvement is accompanied by insensitive parent- 
ing (e€.g., intrusiveness), it may benefit children in terms of their 
engagement and achievement in school, albeit not necessarily 
feelings of confidence and emotional functioning (Cheung & Po- 
merantz, 2011). Parents may also not need to be sensitive to 
become involved in children’s education. In some cases, parents’ 
involvement may be elicited by teachers (e.g., Green et al., 2007), 
rather than stemming from parents’ sensitive concern for children; 
involvement may also not require substantial parent—child inter- 
action. For example, teachers may bring children’s academic prob- 
lems to parents’ attention and support parents in addressing such 
problems by referring them to resources such as tutoring services, 
or keeping parents up to date on children’s progress. Parents’ 
involvement on the school front may be particularly likely to 
confer benefits in the context of insensitive parenting because it 
requires relatively little interaction with children while still con- 
veying the value of school and directing teachers’ attention to 
children. 


Limitations and Future Directions 


There are several limitations of the current research that suggest 
caution in interpreting the findings. Following much prior research 
(e.g., Englund, Luckner, Whaley, & Egeland, 2004; Grolnick & 
Slowiaczek, 1994; Izzo, Weissberg, Kasprow, & Fendrich, 1999), 
teachers’ reports were used to assess parents’ involvement in 
children’s education. However, parents’ involvement takes place 
at home as well as school. Although teachers may be able to 
accurately report on parents’ involvement on the school front, they 
may not be able to do so when it comes to the home front, 
particularly for parents with whom they do not have much contact 
or insight into their cultural practices. Half of the items in the 
measure used in the current research ask about parents’ involve- 
ment explicitly on the school front; the other half asked about it 
more generally. Future research should incorporate teachers, par- 
ents, and children’s reports, with explicit assessment of parents’ 
involvement on both the school and home fronts. We have sug- 
gested that parents’ involvement on the school front may be 
particularly likely to compensate for early insensitivity because, 
unlike parents’ involvement on the home front, it does not require 
substantial direct interaction with children. However, it is possible 
that parents’ involvement on the home front also plays a compen- 
satory role because it conveys the importance of school while also 
providing useful instruction and practices. 

The assessment of parents’ involvement in children’s education 
used in this research did not distinguish between mothers’ and 
fathers’ involvement. Thus, it is unclear whether mothers, fathers, 


or both are pivotal in offsetting the effects of mothers’ early 
insensitivity on children’s academic functioning. It could be that 
mothers who previously displayed insensitive parenting became 
involved in children’s education during elementary school, thereby 
compensating for the effects of their own earlier insensitivity. It is 
also plausible that it was fathers who compensated for the effects; 
it may be that when children are struggling as a result of mothers’ 
early insensitivity, fathers step in to provide support to children, 
either on their own or in conjunction with mothers. In line with this 
possibility, McBride, Dyer, Liu, Brown, and Hong (2009) sug- 
gested that fathers tend to become involved when children are 
having difficulties in school. Another limitation is that we exam- 
ined parents’ involvement over the whole of elementary school 
rather than at each year of this phase of development. This re- 
flected our assumption that the effects of early insensitivity cannot 
be offset with a brief interlude of involvement, needing instead 
sustained involvement. However, future research using time- 
varying analyses could reveal whether short periods of involve- 
ment also play a compensatory role. A profile approach could also 
be taken to identify distinct profiles that take into account different 
levels of early insensitivity and involvement as well as their 
consistency over time. 

A key question for future research is to what extent the com- 
pensatory role of parents’ involvement in children’s learning doc- 
umented here extends to children’s functioning beyond the aca- 
demic arena. Exposure to early insensitivity has been implicated in 
the development of social and emotional problems among children 
(e.g., Fraley et al., 2013; Haltigan, Roisman, & Fraley, 2013; 
NICHD ECCRN, 2004; Raby et al., 2013; Stams et al., 2002). 
Given that parents’ involvement in children’s education also plays 
a role in social and emotional functioning (e.g., Cheung & Pomer- 
antz, 2011; Hill et al., 2004), it may mitigate the effects of parents’ 
early insensitivity on these aspects of functioning. However, par- 
ents’ involvement in children’s education may need to be accom- 
panied by sensitive parenting, such as autonomy support, to ame- 
liorate social and emotional problems (Cheung & Pomerantz, 
2011). In addition, other parenting practices that directly target the 
psychological resources that facilitate social and emotional func- 
tioning may be more effective than parents’ involvement in chil- 
dren’s education—for example, parents’ involvement in children’s 
social lives as manifest in assisting children with developing 
strategies for resolving peer conflict. 


Conclusions 


Decades of research on the role of the early environment in 
children’s academic functioning have lead investigators to argue 
for its importance (e.g., Fraley et al., 2013; Heckman, 2006; Raby 
et al., 2013). The findings of the current research are consistent 
with this perspective in that early insensitive parenting predicted 
children’s academic functioning at the end of elementary school. 
However, these effects were sizably reduced when parents were 
involved in children’s education during the elementary school 
years. The results of the current research suggest that the detri- 
mental effects of early insensitive parenting for academic func- 
tioning may not be unalterable; parents’ involvement in children’s 
education appears to be a potential avenue for helping children 
overcome academic problems created by their early environment. 
Thus, future research would benefit from identifying what enables 
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parents with a history of insensitive parenting to become involved 
in children’s education. 
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Strengthening Bullying Prevention Through School Staff Connectedness 


Lindsey M. O’Brennan, Tracy E. Waasdorp, and Catherine P. Bradshaw 
Johns Hopkins University 


The growing concern about bullying and school violence has focused national attention on various 
aspects of school climate and school connectedness. The current study examined dimensions of staff 
connectedness (i.e., personal, student, staff, and administration) in relation to staff members’ comfort 
intervening in bullying situations (e.g., physical, verbal, relational), as well as bullying situations 
involving special populations of students (e.g., gender-nonconforming, disability, overweight, sexism, 
racism, and religion). Data for this study were collected from a national sample of 5,064 members of the 
National Education Association (NEA), of whom 2,163 were teachers and 2,901 other school staff. 
Analyses with structural equation modeling indicated that increased staff connectedness was associated 
with greater comfort intervening with bullying. Similarly, having resources available regarding bullying, 
receiving training on the school’s bullying policy, and being involved in bullying prevention efforts were 
significantly associated with comfort intervening. Implications for school-based prevention and school 
climate promoting efforts are discussed. 
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prevention 


Many prevention and intervention programs have the dual focus 
of reducing children’s aggressive and violent behaviors while 
promoting caring and supportive environments for students and 
staff (Gilman, Huebner, & Furlong, 2009; Thapa, Cohen, Guffey, 
& Higgins-D’ Alessandro, 2013). Relationships among individuals 
in the school environment are key contributors to student and staff 
perceptions of their school, often referred to as school connected- 
ness (Libbey, 2004). Several youth violence prevention programs 
emphasize connectedness among students, teachers, administra- 
tors, and educational support professionals as a way of increasing 
staff buy-in and program effectiveness (Beets et al., 2008; Brad- 
shaw, Koth, Thornton, & Leaf, 2009; Greenberg et al., 2003). A 
number of studies have highlighted the importance of teachers’ 
and staff members’ perceptions of the school (e.g., schools’ orga- 
nizational health) for high work productivity, staff efficacy, and 
focus on student success (Bevans, Bradshaw, Miech, & Leaf, 
2007; Hoy & Woolfolk, 1993; Pas, Bradshaw, Hershfeldt, & Leaf, 
2010). In particular, bullying prevention programs have increas- 
ingly focused on building positive relationships within the school 
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community as a means of reducing peer victimization and shifting 
schoolwide norms related to violence (Doll, Song, & Siemers, 
2004; Olweus, Limber, & Mihalic, 1999). 

The current study builds upon the school climate literature by 
examining multiple dimensions of school staff connectedness (i.e., 
staff—student connectedness, personal connectedness, staff—staff 
connectedness, and staff—admuinistration connectedness) in con- 
junction with staff members’ perceptions of bullying prevention 
programming efforts, as they relate to their willingness to inter- 
vene in bullying situations. This issue is particularly relevant for 
schools across the United States, given the nearly ubiquitous 
nature of school bullying and the associated academic, behavioral, 
and social-emotional risks for both perpetrators and targets 
(Swearer, Espelage, Vaillancourt, & Hymel, 2010). Therefore, it is 
critical that educators better understand factors that contribute to 
staff members’ willingness to intervene in different types of bul- 
lying situations. 


Bullying Among School-Aged Youth 


Bullying is broadly defined as intentional and repeated acts that 
occur through direct verbal (e.g., threatening, name calling), direct 
physical (e.g., hitting, kicking), and indirect (e.g., spreading ru- 
mors, influencing relationships, cyberbullying) forms, and it typ- 
ically occurs in situations in which there is a power or status 
difference (Olweus, 1993). In a recent national survey, 75% of 
teachers had a student report a verbal bullying incident to them, 
58% heard reports of relational bullying, 50% physical bullying, 
and 14% cyberbullying (Bradshaw, Waasdorp, & O’Brennan, 
2010; Bradshaw, Waasdorp, O’Brennan, & Gulemetova, 2013). 
Although bullying affects roughly 30% of school-age youth (Brad- 
shaw, Sawyer, & O’ Brennan, 2007; Nansel et al., 2001), there are 
populations of students who are at an increased risk for peer 
victimization. For instance, research indicates that youth identify- 
ing as lesbian, gay, bisexual, or transgender (LGBT) are more 
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likely to be targets of bullying than their heterosexual peers (Ko- 
sciw, Greytak, Diaz, & Bartkiewicz, 2010), which in turn increases 
their risk for psychological functioning in young adulthood (Rus- 
sell, Ryan, Toomey, Diaz, & Sanchez, 2011). Students and staff 
members also report bullying related to weight to be a significant 
problem across grade levels (Bradshaw et al., 2013), with over- 
weight or obese students being more likely to experience bullying 
than their peers (Brixval, Rayce, Rasmussen, Holstein, & Due, 
2012). Likewise, students with disabilities, especially those lack- 
ing age-appropriate social skills and displaying behavior problems, 
are at an increased risk of being targets of peer victimization 
(Rose, Monda-Amaya, & Espelage, 2011; Zablotsky, Bradshaw, 
Anderson, & Law, 2012). Last, racial and ethnic minorities may 
also be more likely to report experiencing bullying at school 
(Sawyer, Bradshaw, & O’Brennan, 2008). Despite research con- 
sistently showing that special populations are frequent targets of 
bullying, little is known about the relation between school staff 
members’ comfort intervening in bullying situations and their 
sense of connectedness to the school. 


School Staff Connectedness and Bullying Prevention 


Accumulating evidence suggests school climate and school con- 
nectedness are multidimensional constructs that include school 
safety, quality of relationships, discipline practices, and aspects of 
the physical environment (Thapa et al., 2013; You, O’Malley, & 
Furlong, 2013; Zullig, Koopman, Patton, & Ubbes, 2010). Con- 
sistent with ecological systems theory (Bronfenbrenner, 1979; 
Bronfenbrenner & Morris, 1998), these evolving relationships play 
a role in staff members’ sense of connectedness with others at 
school. In turn, this sense of affiliation and commitment with the 
school organization is expected to positively impact one’s likeli- 
hood of intervening in bullying situations. The sections that follow 
summarize literature on four interrelated dimensions of staff con- 
nectedness: (a) personal sense of safety and connectedness to 
school, (b) student-staff relationships, (c) staff relationships to 
fellow employees, and (d) staff connectedness to administrators as 
they relate to bullying interventions across populations of students. 


Personal Connectedness 


Personal connectedness is often thought of as a composite of 
feelings of respect and support from others at the school, percep- 
tions of safety, and overall job satisfaction (Butler, 2012; Parker, 
Martin, Colmar, & Liem, 2012; Skaalvik & Skaalvik, 2011). An 
often-overlooked aspect of staff's personal connectedness is their 
perceived level of safety of the school environment. Although 
students’ reports of safety are typically the catalyst for schoolwide 
violence prevention programs, national data revealed 7% of edu- 
cators report being threatened and 4% of school staff report being 
physically attacked by a student (Keigher, 2009). Examining this 
in the context of schoolwide bullying efforts, it seems plausible 
that teachers and educational support professionals would be less 
likely to intervene in bullying situations when they perceive ag- 
gressive behavior to be the norm for the school (Kochenderfer- 
Ladd, & Pelletier, 2008). Conversely, when staff discern there to 
be a positive and prosocial climate at the school, they may feel 
more comfortable addressing issues of peer victimization, partic- 
ularly in situations involving sensitive issues like ethnicity, obe- 
sity, and gender nonconformity. 


Student-Staff Connectedness 


A long line of research has documented the importance of 
student-teacher relationships across grade levels. Student—-teacher 
connectedness has been found to serve as a protective factor from 
the deleterious effects of bullying on students’ academic achieve- 
ment (Konishi, Hymel, Zumbo, & Li, 2010). Likewise, youth who 
report low school connectedness tend to report more instances of 
physical, verbal, and relational forms of peer victimization 
(O’Brennan & Furlong, 2010). From the school staff perspective, 
teachers’ relationships with students are a strong predictor of their 
professional commitment to teaching and loyalty to their specific 
school (Collie, Shapka, & Perry, 2011). Teachers who are close 
with their students are more likely to report greater job satisfaction 
and teacher efficacy and reduced student problem behavior in the 
classroom (Collie, Shapka, & Perry, 2012). With regard to bully- 
ing prevention efforts, multilevel studies suggest schools with 
more positive student-teacher relationships tend to have reduced 
rates of bullying episodes (Richard, Schneider, & Mallet, 2012). 
Students may also be more likely to report bullying incidents to 
staff members with whom they have existing relationships, thus 
increasing the probability school staff will intervene. Furthermore, 
staff who feel personally connected to their students may be more 
likely to broach sensitive topics and more directly address topics 
that have been historically taboo in schools, such as ethnic differ- 
ences and sexual orientation. Similarly, staff who feel personally 
connected to their students may also be inclined to put issues of 
difference aside and support students who are different than 
themselves. 


Staff—Administration Connectedness 


School staff members’ relationships with administrators also 
have been shown to be important, especially as they relate to 
implementing schoolwide programs and new initiatives. For ex- 
ample, program implementation research shows that it takes 
schools roughly 3—5 years to implement schoolwide programs with 
fidelity (Bradshaw, Reinke, Brown, Bevans, & Leaf, 2008); thus, 
it is essential for administrators to foster staff buy-in for program 
success. Strong working relationships among staff and administra- 
tion are often forged through shared leadership on schoolwide 
policies and interventions. For instance, Sun, Shek, and Siu (2008) 
found that a key component to successful program implementation 
was teachers’ ability to forge caring, respectful, and supportive 
relationships with the school administration. By forming positive 
staff-administrator relationships, schools are modeling positive 
interpersonal behaviors for students. Consequently, it was pre- 
dicted that school staff who report positive relationships with their 
administrators would be more invested in school violence pro- 
gramming and in turn feel more comfortable addressing bullying. 
We further anticipated that administrator support would be espe- 
cially important for addressing bullying situations among special 
populations due to the sensitive nature of these incidents. 


Staff—Staff Connectedness 


Similar to staff—administrator relationships, school staff mem- 
bers’ relationships with each other are salient aspects of school 
connectedness and program implementation. Teachers who openly 
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communicate with their peers also tend to be more open to pro- 
fessional growth and innovation (Collie et al., 2011). These find- 
ings have been replicated among schools most at risk for teacher 
attrition (Brown & Medway, 2007). In terms of bullying interven- 
tion, Kallestad and Olweus (2003) found that staff members’ 
openness and communication with one another significantly im- 
pacted the implementation of an antibullying program. This asso- 
ciation has been found to endure over time, with research showing 
that when teachers felt supported by their peers and administrators, 
they perceived the school climate more positively and delivered 
more lessons in a prevention curriculum (Gregory, Henry, 
Schoeny, & Metropolitan Area Study Research Group, 2007). 
Thus, it would appear that staff who feel connected to one another 
would be more invested and comfortable intervening in bullying at 
their school. 


Overview of the Current Study 


In an effort to address gaps in bullying prevention literature, the 
current study explored four key aspects of staff connectedness: (a) 
personal connectedness to school, (b) student—staff relationships, 
(c) staff connectedness to administrators, and (d) staff relationships 
to fellow employees as they relate to comfort intervening with 
bullying in general, and more specifically in situations involving 
special populations of students. Structural equation modeling 
(SEM) was used to examine the association between school staff 
perceptions of connectedness and their comfort intervening with 
general bullying situations (physical, relational, verbal, and cyber) 
and when the bullying was specifically related to special popula- 
tions (i.e., LGBT youth, students with disabilities, racial/ethnic 
minority students, youth who are overweight). An SEM approach 
was selected because it allowed us to test our hypotheses using a 
latent variable framework (Bollen, 1989; Kline, 2005). Based on 
the available research, we predicted that school staff who report 
higher levels of connectedness would be more likely to intervene 
in bullying compared with school staff reporting low levels of 
connectedness. Specifically, we hypothesized that particular forms 
of connectedness, including personal sense of safety and connect- 
edness to school, student-staff relationships, and staff connected- 
ness to administrators and their peers, would be associated with a 
greater likelihood of intervening in bullying among special popu- 
lations. Finally, we examined whether the existence of formal 
bullying prevention programs and policies, involvement and train- 
ing related to programming efforts, and staff perceptions of avail- 
able bullying resources are associated with comfort intervening. 
We hypothesized that staff who report clear policies and preven- 
tion efforts had access to available resources and were involved in 
the training efforts would be more comfortable intervening. 


Method 


Sample 


Data for the current study come from the National Education 
Association (NEA), the country’s largest teachers’ union, which 
includes 3.2 million members nationwide. The authors partnered 
with the NEA to conduct a large-scale national study examining 
staff members’ perceptions of bullying and the school environ- 
ment. The sample included 5,064 adults who were members of the 


NEA at the time of the data collection and were actively employed 
by a school or school system. A little over half of the sample was 
education support professionals (ESPs; n = 2,901) and the remain- 
ing participants were teachers (n = 2,163). As later described in 
greater detail, the sample was weighted to be representative of the 
full population of NEA members. Nearly half of the ESPs were 
paraprofessionals (49%), followed by maintenance (14%), clerical 
(10%), school transportation (10%), food service (7%), health and 
student services (2%), technical and skilled trades (2%), security 
(1%), and other nonteaching support staff (6%). Women composed 
80% of the sample, and 89% of the sample self-identified as 
White, with 5% Black, 4% Hispanic, and 2% other. The partici- 
pants were employed in a variety of school locations (34% sub- 
urban, 24% small town, 24% urban, and 18% rural areas). Ap- 
proximately 39% worked with students in elementary, 19% in 
middle, and 27% in high schools, with the remaining 16% working 
across multiple grade levels (see Watts, 2010). 


Procedure 


The data were collected in the spring of 2010. In an effort to 
survey a representative sample of NEA members, both a telephone 
(63%) and a Web survey (37%) were used. Specifically, we used 
the Web because of growing concerns that individuals are less 
inclined to participate in and/or be reached by phone surveys 
(Holbrook, Krosnick, & Pfent, 2007). In total, 1,601 teachers and 
2,142 ESPs completed the telephone survey, whereas 562 teachers 
and 759 ESPs completed the Web survey. The data collection 
activities were conducted by an external professional research firm 
contracted by the NEA; the subcontractor made the phone calls 
and administered the survey on behalf of the NEA. With regard to 
incentives for participation, a lottery was used, whereby all par- 
ticipants were informed that 20 participants would be selected at 
random for $100. Participants were told that the purpose of the 
study was to inform the NEA about members’ concerns and needs 
related to bullying and school climate. A sampling procedure was 
used to select participants, which accounted for role and select 
demographics (e.g., age, region, race), thereby allowing the data to 
be weighted up to reflect the entire population of NEA members; 
weighting is possible because of the known population distribu- 
tions in the overall NEA membership database. As later described 
in greater detail, two weighting procedures were utilized on the 
data: a propensity score was used to adjust for the mode of survey 
administration (i.e., Web vs. phone) and a rim weight to weight the 
entire data set to the national population of NEA members (Watts, 
2010). The overall participation rate was 31% (35% phone, 24% 
Web). 


Measure 


The NEA Bullying Survey (see Bradshaw et al., 2010, for the 
complete survey; also see Bradshaw et al., 2013) was developed by 
the research team in close collaboration with the NEA Research 
Department. Consistent with previous studies of bullying (Brad- 
shaw et al., 2007; Nansel et al., 2001; Olweus, 1993), bullying was 
defined on the survey as “intentional and repeated aggressive acts 
that can be physical (such as hitting); verbal (such as threats or 
name calling), or relational (such as spreading rumors or influenc- 
ing social relationships). Bullying typically occurs in situations 
where there is a power or status difference.” 
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Connectedness. Staff connectedness was assessed through 
items from the Charles F. Kettering Climate Scale (Johnson, 
Johnson, Kranch, & Zimmerman, 1999) and the Collegial Lead- 
ership subscale of the Organizational Health Inventory (Hoy & 
Woolfolk, 1993). In total, there were 21 items (full scale, « = .95) 
broken into four subscales: Personal Connectedness (eight items, 
a = .89; “My ideas are listened to; people care about me at this 
school,” “I like to work in my school”), student-staff connected- 
ness (four items, « = .90; “Staff really care about the students”; 
“Staff are on students’ side”), staff—staff connectedness (five 
items, a = .91; “Staff are friendly to each other,” “Staff have trust 
and confidence in each other”), and staff administration connect- 
edness (four items, a = .90; “Principal shows staff appreciation,” 
“Principal looks out for staff’). Response options were on a 
4-point Likert scale from disagree strongly (1) to agree strongly 
(4), with higher scores indicating higher levels of connectedness. 

Comfort intervening with bullying. Staff were asked how 
comfortable would they feel intervening with a student who en- 
gaged in bullying (4-point scale, ranging from very uncomfortable 
to very comfortable) across four forms of bullying (i.e., physical, 
relational, verbal, and cyber). The five forms of bullying were 
correlated (range .5O—.74) and therefore were modeled as a single 
latent variable, “comfort intervening with general bullying” (a = 
.87). 

Comfort intervening with special populations. Staff mem- 
bers were asked how comfortable would they feel intervening with 
a student who engaged in bullying (4-point scale, ranging from 
very uncomfortable to very comfortable) across six situations of 
bullying (i.e., bullying related to sexual orientation or gender 
nonconformity, disability, being overweight, sexism, racism, and 
religion). The six types of bullying were highly correlated (range 
.65—.84), and, therefore, were modeled as a single latent variable, 
“comfort intervening with special populations” (a = .95). 

Perceptions of bullying policies and programming. 
Perceptions of the school’s bullying policies and programming 
were assessed through three yes/no questions: (a) Does your 
school district have a bullying policy? (b) Is the policy clear and 
easy to implement? (c) Did you receive training on how to imple- 
ment the policy? Perceptions of bullying prevention efforts were 
assessed through two yes/no items: (a) Does the school you work 
in most frequently have a formal prevention efforts—such as 
school teams, a committee, or prevention program that deals with 
bullying? (b) Are you currently involved in bullying prevention 
activities at the school you work in most frequently? To assess 
resource availability, staff rated one item (“There are resources 
available to me to help me intervene with bullying”) on a 4-point 
scale from disagree strongly to agree strongly. These items were 
adapted from the measure by Bradshaw et al. (2007) (also see 
Bradshaw et al., 2013). All items had “Not sure” as an option, 
which was coded as missing in these analyses. The survey did 
employ a skip patter, whereby if a staff member reported “no” to 
particular question, he or she would not be asked follow-up ques- 
tions regarding that particular issue. For example, in the situation 
where a participant indicated that his or her school did not have a 
bullying prevention policy, that person was not asked follow-up 
questions regarding how easy it was to implement the policy. As 
a result, those individuals not asked a particular question were 
excluded from analyses of that question. 


Overview of Analyses 


A series of SEMs was fit in to test our primary research 
questions related to the association between various aspects of 
school climate and staff members’ willingness to intervene in 
different bullying situations (i.e., general forms of bullying and 
with special populations). Based on prior research with this sample 
(Bradshaw et al., 2013), we adjusted for a set of covariates includ- 
ing amount of interaction between students and staff, school level 
(elementary and high school, with middle school as the reference 
category), school location (urban vs. suburban/rural), survey mo- 
dality (Web vs. phone), and role in school (ESP vs. teacher). Staff 
age and number of years working in education were also included 
as continuous variables. We also applied sampling weights in all 
analyses (see later description). Missing data were generally not a 
concern, as 93% of the sample had no missing data, and each item 
had less than 2% missing. Utilizing Mplus Version 7.1 (Muthén & 
Muthén, 1998-2012), missing data are assumed to be missing at 
random and all analyses adjusts for missing data using full infor- 
mation maximum likelihood. 

Sample weighting. Two types of weights were applied to the 
data. First, we applied a propensity score weight to adjust for the 
mode of survey administration (i.e., Web vs. phone; Rosenbaum & 
Rubin, 1983; Schonlau, van Soest, Kapteyn, & Couper, 2009). The 
purpose of the propensity score weights was to make the Web- 
based survey comparable to the phone-based survey. Each partic- 
ipant was assigned a weight based on his or her propensity score, 
which was constructed based on 16 different demographic vari- 
ables (e.g., full- vs. part-time worker, region of the country, has 
phone/landline, suburban location, years worked in the school, 
interaction level with students). These methods are commonly 
used in large-scale surveys that employ both phone and Web-based 
assessments (for additional details, see Schonlau et al., 2009; 
Taylor, 2000). Our decision to apply this type of weight was based 
on preliminary analyses of the data, which suggested that there 
were some systematic differences in the responses to select survey 
items based on the mode of survey administration. For example, 
phone respondents had a tendency to report greater comfort inter- 
vening in the different types of bullying situations assessed; this is 
likely due to a social desirability bias among phone participants 
(Kreuter, Presser, & Tourangeau, 2008; Watts, 2010). As a result, 
the propensity score weights, along with controlling for survey 
administration as a covariate in the analyses, allowed us to account 
for potential bias associated with those respondents who com- 
pleted the Web survey compared with those who completed the 
phone survey. The second weight applied was a rim weight, which 
is a common weighting approach that enabled us to weight the 
entire data set to the national population of NEA members (Watts, 
2010). Specifically, rim weighting was utilized to weight the 
sample that participated in the survey to those in the known NEA 
population. Therefore, the weighted sample reflects the full NEA 
membership. 


Results 


Sample Demographics 


On average, staff reported working at the school for 10 years 
(SD = 8.85, range: 0-77 years) and ranged in age from 19 to 80 
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years old (M = 46.21, SD = 14.55). Majority of staff (74.4%) 
reported interacting with students constantly, 18.6% reported in- 
teracting with students “a great deal,” and 7% reported very little 
interaction with students (i.e., “only a little” or “almost none’). 
Approximately 34% reported the school they work in was located 
in a suburban community, 23.9% an urban community, 24.1% a 
small town, and 17.6% in a rural community. Roughly 40% of staff 
worked in elementary schools, 33.4% in middle schools, and 
27.3% in high schools (for additional descriptives, see Bradshaw et 
al., 2013). 


Fit of the Measurement Model 


Confirmatory factor analysis was used to assess the fit of the 
four latent connectedness variables (1.e., personal connectedness, 
staff—staff connectedness, principal connectedness, and _ staff— 
student connectedness; see Table 1). This model demonstrated 
acceptable fit (comparative fit index [CFI] = .95, Tucker—Lewis 
index [TLI] = .94, root-mean square error of approximation [RM- 
SEA] = .03, standardized root-mean-square residual [SRMR] = 


Table 1 
Standardized Parameter Estimates for Staff Connectedness and 
Comfort Intervening Latent Variables 


Latent variable Estimate 
Staff connectedness 
Personal connectedness 
Like to work at school oe) 
Ideas listened to wD: 
Someone to count on '53 
People care about me le 
Feel wanted and needed 84 
Feel safe SS) 
Recognition for good job 74 
Inspired to do my best 81 
Staff connectedness 
Staff like each other 85 
Staff are friendly with each other 82 
Staff trust and have confidence in each other 81 
Staff help each other 19 
Staff respect each other .86 
Principal connectedness 
Principal shows appreciation .88 
Principal conveys what’s expected of staff ae 
Principal looks out for staff Dil 
Principal is friendly and approachable 82 
Student connectedness 
Students feel staff are “on their side” 62 
Staff feel pride in the school and its students 81 
Staff really care about students .68 
High expectations for students to achieve .67 
Comfort intervening with special populations 
Sexual orientation or gender nonconformity 9 
Disability .87 
Overweight 87 
Sexist comments 87 
Racist comments a2 
Negative comments about religion 88 
Comfort intervening with general bullying 
Physical 16 
Verbal 292 
Relational 88 
Cyber .65 


0.04). Similarly, the latent variable “comfort intervening with 
general bullying,” composed of the four different forms of bullying 
(i.e., physical, relational, verbal, and cyber) had adequate fit, 
CFI = .99, TLI = .96, RMSEA = .06, SRMR = .02 (see Table 1). 
Finally, the latent variable “comfort intervening with special pop- 
ulations,” composed of the six different types of bullying situations 
(i.e., bullying related to sexual orientation/gender-nonconformity, 
disability, being overweight, sexism, racism, and religion), had 
adequate fit, CFI = .97, TLI = .95, RMSEA = .06, SRMR = .03 
(see Table 1). 


Fit of the Structural Model 


To assess our primary research aims, we fit a series of three 
SEM models for the two separate outcomes: (a) staff comfort 
intervening with general bullying and (b) staff comfort intervening 
with special populations. 

Staff connectedness. Model | examined the relationship be- 
tween the four connectedness latent variables (i.e., personal con- 
nectedness and student-staff, staff—administration, and staff—staff 
connectedness) and comfort intervening with bullying. For general 
bullying situations, Model 1 demonstrated acceptable fit, CFI = 
.95, TLI = .95, RMSEA = .03, SRMR = .04 (see Table 2), and 
indicated there were no significant associations between connect- 
edness and comfort intervening with general bullying situations. 
With regard to the model covariates (see Table 3), staff in urban 
settings had lower levels of connectedness than rural and sub- 
urban staff. Teachers had lower levels of connectedness than 
ESPs. In general, elementary staff reported higher levels of con- 
nection compared with middle school staff, whereas high school 
staff reported lower levels of connection compared with those in 
middle school. 

When comfort intervening with special populations was the 
outcome, Model | also demonstrated acceptable fit, CFI = .95, 
TLI = .95, RMSEA = .03, SRMR = .04 (see Table 2); however, 
results indicated that greater connectedness was associated with 
greater comfort intervening with bullying among special popula- 
tions. As illustrated in Table 2, personal connectedness to the 
school and staff—staff connectedness were significantly associated 
with comfort intervening (p < .05). The student—staff connected- 
ness was also positively associated with comfort intervening (p < 
.05). Principal connectedness, however, was not significantly as- 
sociated with comfort intervening with special populations. The 
associations between covariates and connectedness variables were 
similar to those in the comfort intervening with general population 
model (see Table 3). 

Bullying policies and programming. To examine our second 
aim, we added the following three school characteristics to the 
previous models: (a) presence of a district bullying prevention 
policy, (b) presence of formal bullying prevention programming at 
the school, and (c) available resources regarding bullying. For 
comfort intervening with general bullying, the three school char- 
acteristics were included (see Model 2 in Table 2) and demon- 
strated adequate fit (CFI = .93, TLI = .92, RMSEA = .03, 
SRMR = .08). Results indicated that the availability of resources 
was associated with feeling more comfortable intervening with 
general bullying. 

The final model (Model 3) examined staff personal involvement 
and perceptions of bullying resources and programming as they 
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Standardized Estimates for Latent Variables in Model and Model Fit Statistics for Special 


Populations and General Bullying 


a a a ae lee a tae 








Model 1 Model 2 Model 3 
Special General Special General Special General 
Latent variables populations bullying populations bullying populations _ bullying 
Personal connectedness 0.05" 0.04 0.05“ 0.05* 0.05* 0.07" 
Staff connectedness 0.06* 0.04 0.06" 0.06" 0.06" 0.08" 
Principal connectedness —0.002 0.01 0.001 0.02 0.001 0.02 
Student connectedness 0.05" 0.02 0.05" 0.03 0.05* 0.10° 
District has a policy 0.02 0.03 — ao 
Formal prevention efforts —0.01 0.02 — = 
Resources available Ona 0.06"** 0.06" 0.10" 
Policy easy to implement 0.03 0.04 
Received training on policy 0.05“ 0.01 
Involved in prevention efforts 013" ~ Oger 
Model fit statistics 
Comparative fit index 95 94 93 94 293 
Tucker—Lewis index 295) 93 SD BOE 102) 
Root-mean-square error of 
: approximation .026 028 .031 .028 027 
Standardized root-mean-square 
residual .035 .075 .078 .075 .073 
pa OD ep Ole pe (01. 


relate to comfort intervening in general bullying situations. As 
seen in Table 2, Model 3 demonstrated adequate fit (CFI = .93, 
TLI = .92, RMSEA = .03, SRMR = .07) and indicated that 
having resources available regarding bullying and being involved 
in bullying prevention efforts were significantly associated with 
comfort intervening. However perceiving a school’s bullying pol- 
icy as being easy to implement and receiving training on the 
schools bullying policy were not significant (see Table 2). 
Similar to intervening in general bullying situations, the results 
of Model 2 indicated that availability of resources was also asso- 
ciated with feeling more comfortable intervening in bullying 
among special populations. Yet, neither the presence of a district 
policy nor formal prevention programming in the school was 
significantly associated with staff's comfort. As seen in Table 2, 
this model demonstrated adequate fit (CFI = .94, TLI = .93, 
RMSEA = .03, SRMR = .04). The final model (Model 3) exam- 
ined staff personal involvement and perceptions of bullying re- 


Table 3 
Standardized Estimates for Covariates Included in Model 1 


sources and programming in combination with the retained vari- 
ables from Models 1 and 2. Model 3 demonstrated similar fit to the 
prior model (CFI = .94, TLI = .93, RMSEA = .03, SRMR = .07). 
Results indicated that having resources available regarding bully- 
ing, receiving training on the school’s bullying policy, and being 
involved in bullying prevention efforts were significantly associ- 
ated with comfort intervening; however, perceiving the policy as 
being easy to implement was not (see Table 2). 

Model covariates. Several covariates were included in the 
general bullying and special populations models to control for staff 
characteristics as they relate to the latent connectedness variables 
(see Table 3). For both the general bullying and special popula- 
tions models, teachers (compared with ESPs), high school staff, 
and staff working in urban neighborhoods tended to have lower 
levels of connectedness (p < .05). Surprisingly, the amount of 
interaction between staff and students was not significantly related 
to their level of connectedness. In terms of personal characteristics, 





Personal connectedness 


Staff connectedness 


Principal connectedness Student connectedness 








Special General Special General Special General Special General 
Covariates populations bullying populations bullying populations bullying populations bullying 
Amount of student interaction 0.02 0.02 0.02 0.02 = 0102 —0.01 | 0.00 0.00 — 
Survey mode COsilesiee (Opes 0207" OxIGy 0.26" 0.14" ONS se OtSie 
Urban Ora —0.14e ea lesae Oe oe OSL as —0.08"" ae El Sana c= li aes 
Teachers (vs. ESPs) ce OS as —0.06""* —0.05" (O03. ca Ome aa) Ome —0.05™ —0.04" 
Elementary school 0.07"" 0.07" 0.05 0.05 0.05 0.03 0.18" 02053 
High school == (ale E- Oneaes =O ONO — Onl: —0.07" —0.09™ = Onis 
Years worked 0.001 = O005m 0.001 —01003%8 —0.002 —0.008"" 0.002 —0.001** 
Age —0.001 10) 0m 0.002 —0.001"" —0.001 —0.004"" 0.002" 0.000 
Note. Middle school served as the reference group for the elementary and high school covariates. ESPs = education support professionals. 


*  < .001. 


pins ae p eo: 
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staff who were older and had spent more years working in educa- 
tion tended to have lower levels of connectedness, but this was 
only the case for the general bullying model (p < .01). Finally, we 
conducted a set of exploratory post hoc interactions for urbanicity 
and school level (in lieu of modeling them as covariates). We 
constrained the association between the latent school connected- 
ness variables and the outcome to be equal across groups and 
examined the difference in model fit via the Wald test (Muthén & 
Muthén, 1998-2012); however, we found no evidence of a sig- 
nificant interaction effect involving either of these variables (re- 
sults not reported). Therefore, the exploratory interaction effects 
were dropped from the final models. 


Discussion 


The current challenge in bullying prevention is both reducing 
rates of violence at school while also bolstering school climate and 
student and staff members’ connection to the school community 
(Waasdorp, Bradshaw, & Leaf, 2012). Despite the surge in the 
literature on these two interrelated topics, relatively little empirical 
research has specifically examined the overlap between bullying 
prevention and school connectedness from the perspective of 
school staff. The purpose of the current study was to tease apart the 
multiple domains of staff connectedness and examine how they 
directly relate to the comfort of school staff in intervening in 
bullying situations. In addition, we examined how schoolwide 
policies and programming influence staff members’ likelihood of 
intervening. This line of research has considerable relevance for 
educational psychologists interested in improving conditions for 
learning and engaging school staff in prevention efforts. 


Staff Connectedness and Comfort Intervening 


We found that the relationship between staff connectedness and 
comfort intervening varied depending on whether it was a general 
bullying situation or one that involved special populations of 
students. For general bullying, personal, student, and peer con- 
nectedness were salient factors in staff members’ comfort inter- 
vening only when schoolwide policies and programming efforts 
were added to the model. On the other hand, higher levels of staff 
connectedness were consistently related to reports of being more 
comfortable intervening with special populations. Specifically, 
staff members’ close relationships with students and their col- 
leagues, as well as the school in general had a positive impact on 
their comfort intervening with at-risk groups of students. This 
finding is consistent with the educational research suggesting that 
trust and support are the key elements to creating successful 
working relationships within the school (Skaalvik & Skaalvik, 
2011; Wahlstrom & Louis, 2008). These positive working rela- 
tionships likely help to create a collective sense of school pride 
that encourages school personnel to take a proactive stance on 
bullying prevention. 

However, not all aspects of connectedness were associated with 
staff members’ comfort intervening. Of the four domains of con- 
nectedness, staff relationships with administration were not pre- 
dictive of comfort intervening in general bullying and special 
populations. This is somewhat surprising since research shows that 
when teachers and staff feel supported by their administration, they 
tend to report higher levels of commitment and more collegiality, 


the consequence of which is increased staff retention (Singh & 
Billingsley, 1998). Perhaps, teachers’ relationship with the 
school’s administration team is more salient with regard to school- 
wide program buy-in but not necessarily for on-the-spot bullying 
intervention. Additional research is needed to examine how ad- 
ministrative support impacts staff perceptions of the school and 
involvement in bullying intervention efforts. In contrast, although 
few specific hypotheses were formulated related to the staff con- 
nectedness with colleagues, this form of connectedness was asso- 
ciated with intervening in general bullying situations and special 
populations. Taken together, these findings emphasize the impor- 
tance of multiple domains of connectedness in influencing staff 
members’ willingness to intervene in bullying situations. 

School factors and connectedness. The analyses also high- 
lighted some salient school-level factors that influence staff reports 
of connectedness. This is particularly important given prior re- 
search suggesting that characteristics of the school environment 
are linked to teacher willingness to implement programs in their 
classroom (Domitrovich et al:, 2008; Han & Weiss, 2005). For 
example, staff in urban settings had lower levels of connectedness 
than schools located in rural and suburban settings. Schools in 
urban inner-city neighborhoods are typically at a greater economic 
disadvantage, have less social cohesion, and have fewer resources 
for educating children than suburban schools (Elliott et al., 1996; 
Tolan, Gorman-Smith, & Henry, 2003). With the increased risk for 
schoolwide disorganization, urban schools also tend to have higher 
rates of disruptive student behavior. A national survey of teachers 
revealed that teachers working in urban communities report more 
problem behavior in their classrooms than teachers working in 
suburban and rural schools (Provasnik et al., 2007). Thus, it is 
important for these schools to tailor bullying interventions that 
both teach staff how to intervene effectively and build connected- 
ness and trust among the school community. 

Second, there were several significant differences by grade 
level, such that elementary staff reported the highest level of 
connectedness, followed by middle, and then high school staff 
members; however, the associations were not moderated by school 
level. Prior research on student connectedness found similar re- 
sults, with younger students having higher levels of school con- 
nectedness than their older peers (Furlong, Pavelski, & Saxton, 
2002; Whitlock, 2006). Elementary teachers tend to develop closer 
relationships with their students since they stay in one class the 
majority of the school day, whereas middle and high school-age 
youth frequently transition from class to class. Middle and high 
schools also tend to have higher enrollments and larger school 
campuses, which reduces the ability of staff members to form 
close, personal relationships with one another. One way that 
schools can foster connectedness among teaching staff is by in- 
volving staff across grade levels in schoolwide programming 
efforts, as opposed to specific grade-level meetings. 


Bullying Programming and Comfort Intervening 


A secondary aim of the study was to examine the link between 
staff perceptions of their school’s bullying prevention program- 
ming and their comfort intervening with bullying. It is interesting 
that the presence of a district bullying policy and its ease of 
implementation, as well as the availability of formal prevention 
programming at the school, were not associated with comfort 


BULLYING PREVENTION VIA STAFF CONNECTEDNESS 877 


intervening in bullying. In other words, simply having policies and 
programs available does not equate to teachers and staff feeling 
comfortable implementing programs with fidelity. Rather, the 
analyses show that staff who are actively engaged with schoolwide 
programming are more likely to help those youth involved in 
bullying. Specifically, when staff were involved in the bullying 
prevention efforts and had access to bullying resources at their 
school, they tended to feel more comfortable intervening in both 
general forms of bullying, as well as bullying related to sexual 
orientation or gender nonconformity, disability, being overweight, 
sexism, racism, and religion. Previous research indicates that bul- 
lying prevention programs are not.only more effective but are 
more likely to be sustained over time if staff and administrators 
take part in developing the program (Hirschstein & Frey, 2006; 
Rigby, 2007). In a longitudinal study, Rhodes, Camic, Milburn, 
and Lowe (2009) found that when school staff were active collab- 
orators in identifying schoolwide programs (e.g., antibullying, 
after-school activities, and teacher wellness programs) and subse- 
quently assisted in the implementation of these interventions, there 
were increases in teacher attitudes and perceptions of school 
climate. More important, these changes in perceptions of climate 
were found to positively impact students’ perceptions of the school 
(e.g., positive peer interactions, teacher support, and academic 
engagement). Therefore, prior to implementing a prevention pro- 
gram, it would be important to assess the level of engagement and 
interest from teachers and school staff in order to ensure program 
effectiveness. 

Drawing upon the education literature, studies have shown 
teachers’ level of experience and prior training influence their 
perceptions of self-efficacy to teach students (e.g., Tschannen- 
Moran & Woolfolk Hoy, 2007). This line of research suggests that 
when teachers feel that they have the skills to intervene in bullying 
or perceive they can help change the school norms related to peer 
victimization (Wood, 1992; Domitrovich et al., 2008), the more 
likely it is that they will take a proactive stance on reducing 
bullying at school. A growing body of research has shown that 
when teachers and administrators work collaboratively on school- 
based prevention programs, teachers tend to be more invested in 
the program’s short- and long-term outcomes (Adelman & Taylor, 
2003). Taken together, these findings suggest that both students’ 
and staff members’ perceptions of school climate may be predic- 
tive of bullying prevention program implementation and outcomes 
(Beets et al., 2008; Bradshaw et al., 2009; Bradshaw & Waasdorp, 
2009). 


Limitations and Strengths 


It is important to note some limitations when interpreting these 
findings. For example, the data are self-reported; therefore, we are 
unable to ensure the validity of these data. Future research should 
employ a mixed-methods approach, combining teacher report and 
observational data to capture how connectedness relates to staff 
responses to bullying incidents. Social desirability may play a role 
in participants’ responses, and this may vary by the mode of survey 
(i.e., Web vs. phone). Not all participants e-mailed or called agreed 
to participate; therefore, it is possible that staff more involved 
in bullying prevention efforts or more concerned about the issue 
were more likely to agree to participate. The data are cross- 
sectional, so we are unable to draw any conclusions regarding 


causality. Additional research is needed to examine other fac- 
tors, such as teacher efficacy and burnout on teachers’ willing- 
ness to intervene. 

Nevertheless, this study has several strengths, most notably the 
nationally representative design, the large sample size, the linkage 
with the NEA population, the use of propensity scores to address 
potential sampling biases, and the inclusion of teaching and non- 
teaching school staff. Due to the sampling strategy employed, the 
respondents were not nested within schools, and there was very 
little nesting within districts; therefore, multilevel analysis was not 
warranted (Raudenbush & Bryk, 2002). Additional school-level 
factors and student perspectives not assessed in the current study 
(e.g., school climate) could provide a more comprehensive view of 
staff perceptions and responses to bullying and possibly allow for 
multilevel analyses (O’Brennan, Bradshaw, & Furlong, in press). 
There may also be differences in perceptions among the subpopu- 
lations of ESPs (e.g., transportation workers and paraprofession- 
als), which will be investigated in future studies. Although we 
adjusted for region and several school characteristics, it is possible 
that district- or state-level factors (e.g., policies and laws regarding 
bullying) not examined in this study may have influenced the 
pattern of findings. This is another area for investigation in future 
studies. 


Conclusions and Implications for Educational 
Research 


These findings also highlight the importance of connectedness 
particularly in relation to intervening in bullying situations involy- 
ing special populations. Specifically, the findings suggest that the 
same factors associated with intervening in general bullying situ- 
ations were relevant for bullying involving special populations, 
with the exception of receiving training on bullying policies, which 
was relevant to bullying among special populations but not for 
bullying in general. This suggests that bullying prevention policies 
may signal particular relevance for bullying targeting particularly 
sensitive populations. 

The results also suggest that connectedness may be an important 
target for bullying prevention programming and climate promoting 
efforts. It is likely that connectedness, specifically connection to 
students, colleagues, and the larger school community, increases 
staff members’ willingness to intervene in bullying situations. 
Connectedness-promoting activities may enhance staff's dedica- 
tion toward making the school community a positive and safe 
atmosphere for their colleagues and students and, in turn, increase 
their empathy for youth involved in bullying. Connectedness ef- 
forts also have the potential of increasing job satisfaction and 
retention, which is a major concern given the high rate of turnover 
in the field of education (Boe, Cook, & Sunderland, 2008). Recent 
national data suggest that 10% of public school teachers leave the 
profession after 1 year, and an additional 12% leave after 2 years 
of teaching (Kaiser, 2011). Moreover, teacher attrition hinders the 
overall social milieu of the school and limits the fidelity of pro- 
gram implementation from year-to-year (Borman & Dowling, 
2008; National Commission on Teaching and America’s Future, 
2007). However, if schools are able to foster support and trust 
among staff members, they are more likely to reduce rates of 
bullying and implement programs with efficacy. Although not 
examined in the current study, it is likely that these associations 
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generalize to other staff activities, such as efforts to improve 
conditions for learning as well as quality implementation of pre- 
vention programs (Domitrovich et al., 2008). 

It is important to remember, however, that the school climate 
improvement process is slow and likely requires a change in norms 
and behavior (Bradshaw et al., 2009). Identifying factors associ- 
ated with positive schoolwide climate and behavioral changes can 
inform the development of programs and policies related to school 
safety and bullying prevention. Therefore, identifying school con- 
textual factors associated with behavior and school climate change 
would greatly inform the bullying prevention literature. Contextual 
factors, such as the school’s organizational climate or the level of 
disorder within the school or classroom environment, may also 
influence the way in which teachers manage bullying and other 
discipline problems, participate in whole-school prevention ef- 
forts, or refer students to school-based services (e.g., counseling; 
Domitrovich et al., 2008). Further work is needed to determine 
efficient ways to assess readiness and help schools move toward 
fidelity and positive student outcome, and toward strategies to 
program sustainability. 
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This study addressed whether prior successes with educational interventions grounded in the theory of 
successful intelligence could be replicated on a larger scale as the primary basis for instruction in language 
arts, mathematics, and science. A total of 7,702 4th-grade students in the United States, drawn from 223 
elementary school classrooms in 113 schools in 35 towns (14 school districts) located in 9 states, participated 
in the program. Students were assigned, by classroom, to receive units of instruction that were based either 
upon the theory of successful intelligence (SI; analytical, creative, and practical instruction) or upon teaching 
as usual (weak control), memory instruction (strong control), or critical-thinking instruction (strong control). 
The amount of instruction was the same across groups. In the 23 comparisons across 10 content units in 3 
academic domains, there were only a small number of instances in which students in the SI instructional 
groups generally performed statistically better than students in other conditions. There were even fewer 
instances where the different control conditions outperformed the SI students. Implications for the future of 


SI theory and the scalability of research efforts in general are discussed. 
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Throughout the first decade of the 21st century, educational re- 
searchers and policymakers have placed an increased emphasis on the 
twin goals of (a) using experimental designs to evaluate educational 
interventions and (b) gaining a greater understanding of the issues 
related to the scalability of educational interventions. The value 
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placed on interventions that have been experimentally tested is high- 
lighted by repositories such as the U.S. Department of Education’s 
“What Works” clearinghouse (http://ies.ed.gov/ncee/wwe/). Projects 
related to issues of scalability were funded by the Department of 
Education in the early to mid-2000s, and the results of these projects 
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are beginning to receive increased attention in the empirical literature 
(Constas & Sternberg, 2006; McKenna & Walpole, 2010). 

In the present article, we report on a large-scale empirical field 
study that also sought to address issues related to scalability. We 
examined whether applying the theory of successful intelligence to 
instruction and assessment in Grade 4 language arts, mathematics, 
and science would result in superior learning outcomes relative to 
alternative instructional methods, in particular, memory-based in- 
struction and critical-thinking based instruction (strong compari- 
son/control conditions) and teaching as usual—whatever it hap- 
pened to be (weak comparison/control condition). The study 
involved the participation of 7,702 fourth-grade students in 113 
elementary schools and 223 classrooms across the United States in 
35 towns (14 school districts) located in nine states (Alabama, 
California, Connecticut, Massachusetts, Minnesota, Kansas, North 
Carolina, South Carolina, and West Virginia), in order to deter- 
mine whether prior successes with the theory’s instructional ap- 
plication could be replicated at scale. 


Background 


There is evidence to suggest that teaching and assessment may 
be more effective when they are based in part on cognitive- 
psychological theories that have been applied to education (Brun- 
ing, Schraw, & Norby, 2010; Corno, Cronbach, Kupermintz, & 
Lohman, 2001). Certainly, this has been a major claim of research- 
ers as well as textbook authors in educational psychology (e.g., 
Ormrod, 2010; Slavin, 2008; Woolfolk, 2009). One such 
cognitive-psychological theory is the theory of successful intelli- 
gence. 

The theory (Sternberg, 1997, 2005, 2010) argues that successful 
intelligence is a person’s ability to achieve his or her goals in life, 
within his or her sociocultural context, by capitalizing on strengths 
and correcting or compensating for weaknesses, in order to adapt 
to, shape, and select environments through a combination of ana- 
lytical, creative, and practical skills (Sternberg, 2003b, 2009; 
Sternberg, Grigorenko, & Jarvin, 2007; Sternberg, Jarvin, & Gri- 
gorenko, 2009). Different students have different combinations of 
these skills. The theory is based on the notion that students learn 
in different ways and that they have different strengths in learning 
(Sternberg, Grigorenko, & Zhang, 2008a, 2008b), just as teachers 
have different strengths in teaching (Spear & Sternberg, 1987). 
Our goal is to assist teachers in balancing their teaching in such a 
way that each of the abilities can be addressed, exercised, and 
given a chance to develop (Sternberg & Grigorenko, 2000; Stern- 
berg et al., 2007, 2009). 

Teaching for analytical thinking means encouraging students 
to (a) analyze, (b) critique, (c) judge, (d) compare and contrast, 
(e) evaluate, or (f) assess. When teachers refer to teaching for 
“critical thinking,” some of them may mean teaching for ana- 
lytical thinking. Examples of exercises designed to develop 
such skills might ask students to (a) analyze a political speech, 
(b) critique a work of art, (c) judge the value of a social 
program, (d) compare and contrast two works of literature, (e) 
evaluate the conclusions drawn from a scientific experiment, or 
(f) assess the rationale for a cultural custom. 

Teaching for creative thinking means encouraging students to 
(a) create, (b) invent, (c) discover, (d) imagine if... , (e) suppose 
that... , (f) predict... , or (g) design. Teaching for creative 


thinking requires teachers not only to support and encourage 
creativity but also to role-model it and to reward it when it is 
displayed (Sternberg & Lubart, 1995; Sternberg & Williams, 
1996). Examples of such teaching activities might ask students to 
(a) create a work of art, (b) invent an alternative ending for a story 
they read, (c) discover the principle behind a natural phenomenon, 
(d) imagine what life would be like if global warming continued 
unabated, (e) suppose that they grew up alingual—having no 
language at all, (f) predict what will happen in the current civil war 
in Syria, or (g) design a psychological experiment to test a hy- 
pothesis about human behavior. 

Teaching for practical thinking means encouraging students to 
(a) apply, (b) use, (c) put into practice, (d) implement, (e) employ, 
or (f) persuade someone of something. Such teaching must relate 
to the real practical needs of the students, not what would be 
practical for individuals other than the students (Sternberg et al., 
2000). Examples might include asking students to (a) apply what 
they have read in a story to their life, (b) use their knowledge of 
mathematics to balance a checkbook, (c) put theory into practice in 
exercising defensive driving, (d) implement a plan for losing (or 
gaining) weight, (e) employ the rules of haiku and write one, (f) or 
persuade someone that an argument is sound. 


Measurement Research Support for the Theory of 
Successful Intelligence 


A number of different studies have been conducted that 
validate the premise of the theory of successful intelligence in 
the field of assessment and measurement. Here we present them 
only selectively and briefly. 

First, assessments based on the theory of successful intelli- 
gence appear to map onto skills that are relevant, broadly 
speaking, to success in life and various indicators of well-being 
(e.g., Grigorenko & Sternberg, 2001; Sternberg et al., 2000). 
Second, these assessments have demonstrated adequate psycho- 
metric properties (e.g., Kornilov, Tan, Elliott, Sternberg, & 
Grigorenko, 2012; Sternberg, Castején, Prieto, Hautamiki, & 
Grigorenko, 2001). Third, measurements of different kinds of 
skills (analytical, creative, and practical) can be done relatively 
independently of each other (e.g., Grigorenko et al., 2009). 
Fourth, successful-intelligence assessments can improve pre- 
diction of grade-point average as well as prediction of success 
in extracurricular and leadership activities; such assessments 
also can reduce ethnic-group differences in performance (Stern- 
berg, 2010; Sternberg, Bonney, Gabora, Karelitz, & Coffin, 
2010; Sternberg & The Rainbow Project Collaborators, 2006). 
Finally, as illustrated in Advanced Placement Psychology, Sta- 
tistics, and Physics tests, the inclusion of creative and practical 
assessments in addition to memory and analytical ones can 
reduce ethnic-group differences while increasing construct va- 
lidity (Stemler, Grigorenko, Jarvin, & Sternberg, 2006; Stemler, 
Sternberg, Grigorenko, Jarvin, & Sharpes, 2009). 

Thus, there is evidence that assessments based on the theory of 
successful intelligence can provide valuable concurrent and pre- 
dictive information about cognitive functioning at various stages 
of the life span and in various settings. 
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Instructional Research Support for the Theory of 
Successful Intelligence 


A number of instructional studies have been conducted with 
students in different age groups and in various subjects to validate 
the relevance of the theory of successful intelligence in the class- 
room (for more detail and other research support for the theory, see 
Sternberg, 1985, 1997, 2003b; Sternberg, Jarvin, & Grigorenko, 
2011). Here we briefly exemplify two types of relevant studies: 
aptitude—treatment interaction (ATI) and main effect studies of the 
theory. 

An example of the ATI approach is a study (Sternberg, Grigo- 
renko, Ferrari, & Clinkenbeard, 1999) in which the Sternberg 
Triarchic Abilities Test (STAT; Sternberg, 2003a) was used to 
assess analytical, creative, and practical skills through multiple- 
choice and essay items. The test was administered to 326 children 
around the United States and in some other countries who were 
identified by their schools as gifted by any standard whatsoever. 
Children were selected for a summer program in (college-level) 
psychology if they fell into one of five ability groupings: high 
analytical, high creative, high practical, high balanced (high in all 
three abilities), or low balanced (low in all three abilities). The 
high-school students (n = 199) who came to Yale were then 
divided into four instructional groups. Students in all four instruc- 
tional groups used the same introductory-psychology textbook, a 
preliminary version of Sternberg (1995), and listened to the same 
psychology lectures, by a Yale professor teaching the introduction 
to psychology course at Yale College. What differed among the 
four groups was the type of afternoon discussion section to which 
students were randomly assigned. They were assigned to an in- 
structional condition that emphasized memory, analytical, creative, 
or practical instruction. The discussion sessions were taught by 
qualified instructors with no particular training in, or commitment 
to, the theory of successful intelligence. Instructors were assigned 
to the instructional conditions at random and were required to use 
differential teaching approaches. The instructors were unaware of 
students’ patterns of abilities as revealed by the STAT. Consider 
examples of instruction. In the memory condition, the participants 
might be asked to recall the originator of a major theory of 
depression. In the analytical condition, they might be asked to 
compare and contrast two theories of depression. In the creative 
condition, they might be asked to formulate their own theory of 
depression. In the practical condition, they might be asked how 
they could use what they had learned about depression to help a 
friend who was depressed. Students in all four instructional con- 
ditions were evaluated in terms of their performance on home- 
work, a midterm exam, a final exam, and an independent project. 
Each type of work was evaluated for analytical, creative, and 
practical quality. Thus, all students were evaluated in exactly the 
same way. The results indicated the presence of an aptitude— 
treatment interaction whereby students who were placed in instruc- 
tional conditions that better matched their pattern of abilities 
outperformed students who were mismatched. For all performance 
assessments combined, for better matched versus mismatched 
groups, Cohen’s ds were 0.343, 0.195, and 0.255 for analytical, 
creative, and practical, respectively. In other words, when students 
are taught at least some of the time in a way that fits how they 
think, they do better in school. These results suggest that the 
negative Cronbach and Snow (1977) results for aptitude—treatment 


interactions may have been due to lack of theoretical basis for 
instruction or of theoretical match between instruction and assess- 
ment. Pashler, McDaniel, Rohrer, and Bjork (2008), however, 
have argued that there is still only weak evidence for aptitude— 
treatment interactions, and the interested reader can refer to Stern- 
berg et al. (2008b) for an alternative point of view. 

Subsequently, a main-effect study of the theory (Sternberg, 
Torff, & Grigorenko, 1998) examined the learning of social studies 
and science by third graders and eighth graders. The 225 third 
graders were students in a very low income neighborhood, and the 
142 eighth graders were students who were largely middle to upper 
middle class. Classroom teachers, and consequently their students, 
were assigned to one of three instructional conditions pseudo- 
randomly so as to balance the number of students and classrooms 
in each condition. In the first condition, they were taught the 
course that they would have learned had there been no intervention 
(i.e., the emphasis was on memory). In a second condition, teach- 
ing emphasized critical (analytical) thinking. In the third condition, 
students were taught in a way that emphasized a balance of 
analytical, creative, and practical thinking. All students’ perfor- 
mance was assessed for memory learning through multiple-choice 
assessments as well as for analytical, creative, and practical learn- 
ing through performance assessments. As expected, students in the 
successful-intelligence (analytical, creative, practical) condition on 
average outperformed the other students in terms of the perfor- 
mance assessments. In particular, third graders from the 
successful-intelligence instructional conditions did better in four 
out of four comparisons with the standard teaching condition 
(mean Cohen’s d = 1.082 for n = 4) and in three out of four 
comparisons with the critical thinking condition (mean Cohen’s 
d = 0.510 for n = 3). Eighth graders in the successful-intelligence 
condition did better in seven out of seven comparisons with the 
standard teaching condition (mean Cohen’s d = 0.842 for n = 7) 
and in three out of seven comparisons with the critical thinking 
condition (mean Cohen’s d = 1.332 for n = 3). One could argue 
that this result merely reflected the way they were taught. Never- 
theless, the result suggested that teaching for these kinds of think- 
ing succeeded. More important, however, was the result that chil- 
dren in the successful-intelligence condition outperformed the 
other children even on the multiple-choice memory tests (Cohen’s 
ds were 0.289 and 0.383, and 1.283 and 0.833 for standard and 
critical thinking instructional conditions in the third- and eighth- 
grader studies, respectively). In other words, even when the goal is 
simply to maximize children’s memory for information, teaching 
for successful intelligence is still superior. It enables children to 
capitalize on their strengths and to correct or to compensate for 
their weaknesses, allowing them to encode material in a variety of 
interesting ways. 

These results were extended to reading curricula at the middle- 
school and high-school levels (Grigorenko, Jarvin, & Sternberg, 
2002). To illustrate, at the middle-school level (n = 871), language 
arts were taught explicitly for successful intelligence. At the high- 
school level (n = 432), successful intelligence instruction for 
reading comprehension was infused into instruction in mathemat- 
ics, physical sciences, social sciences, English, history, foreign 
languages, and the arts. As in previous studies, each assignment 
contained analytical, creative, and practical tasks. At both middle- 
and high-school levels students who were taught for successful 
intelligence outperformed students who were taught in standard 
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ways (mean Cohen’s d = 0.483 for middle-school level and mean 
Cohen’s d = 0.238 for high-school level). 

Ideally, schools might utilize a uniform broad-based, construct- 
valid, theoretical model in their instruction and assessment and 
even in admissions, where relevant (Sternberg, 2010). Of course, 
the model need not be the theory of successful intelligence. Cer- 
tainly, there are other models (Gardner, 1993, 2006; Mayer, 2011). 

The fundamental difference between the current study and the 
studies discussed above is its scope and specific characteristics. 
Unlike previous studies, which were framed as either development 
and narrowly focused efficacy evaluations (Sternberg et al., 1999) 
or efficacy and replication studies of the theory of successful 
intelligence in the classroom (Grigorenko et al., 2002; Sternberg et 
al., 1998), the present study was conceived of as a scaling-up 
(Sternberg et al., 2006), main-effect evaluation of the utility of the 
theory of successful intelligence in actual classrooms. 


Scaling up Educational Interventions 


Educational research is replete with studies of new and exciting 
interventions that have been shown to work in one particular 
context or another. One of the biggest challenges facing the field 
of educational research, however, is the search for effective inter- 
ventions (e.g., curricular) that yield similar effects across diverse 
contexts (Elmore, 1996). In other words, are there interventions 
that can be successfully scaled up? The concept of “upscaling” is 
derived from economic theories that are currently pervasive in 
discussions surrounding education reform in the United States. 
Specifically, the microeconomic concept of “economies of scale” 
suggests that certain work can be done more efficiently by increas- 
ing the size of operation (Folland, Goodman, & Stano, 2013). In 
light of such reasoning, several funding agencies, including the 
National Science Foundation, the Institute for Educational Sci- 
ences, and the National Institutes for Health, have been engaged in 
funding research that has been demonstrated to work in more 
limited contexts in order to determine whether the results can be 
replicated on a broader scale (e.g., http://www.nsf.gov/pubs/2002/ 
nsf02062/nsf02062.pdf). Their support has funded research by 
several teams (e.g., Clements, 2005; Francis, 2011; Fuchs, 2004; 
Hurtig, 2004; Pane, 2007; Starkey, 2004) as well as the present 
study. 

In their book, Glennan, Bodily, Galegher, and Kerr (2004) 
comprehensively examined the lessons learned from 15 different 
curricular programs that attempted to go to scale. Generally speak- 
ing, the results of this and other research have found three major 
factors affecting successful scale up (Glennan et al., 2004). First, 
if the intervention is developed externally, by sources other than 
teachers themselves, it is often less costly for schools and districts 
(Nunnery, 1998). This is not to say that teachers have no input. 
Rather curricula are often co-constructed, with teachers deciding 
which components to emphasize. Ultimately, however, the easier 
and less costly it is to implement a design, the more likely it is to 
be adopted (Glennan et al., 2004). The successful intelligence 
intervention in the current study was developed externally, al- 
though evaluation input from teachers was central to the process 
(Randi & Jarvin, 2006). 

The second factor affecting the success of educational interven- 
tions is whether they involve whole-school reform or targeted 
reform, in which only some classrooms or student populations 


receive the intervention. Some evidence suggests that when the 
whole school is involved, there is greater buy-in across the board, 
which in turn leads to a greater likelihood of success. The current 
study represents not only an effort at scaling up but also a large- 
scale experimental study of different educational interventions. As 
such, it was neither a targeted reform, per se, nor a traditional 
whole-school reform. 

The third factor impacting the successful scale-up of educational 
reforms is whether they relate to structural changes, teacher knowl- 
edge, or curriculum content. Specifically, prior research has shown 
that structural changes (e.g., classroom size, student groupings, 
team teaching) tend to have smaller impacts on educational out- 
comes than teacher knowledge or curriculum content changes. 
Within the context of the current study, the focus was on teacher 
knowledge and curriculum content. 

Elias, Zins, Graczyk, and Weissberg (2003) have argued that 
there is “a need to better document the stories of educational 
innovation and scaling up efforts so that contextual details can 
enrich an understanding of what is required for success” (p. 303). 
The current study is aimed at not only understanding the factors 
associated with going to scale but also attempting to simultane- 
ously run a large-scale experimental study. 

With this work, we attempted to (a) explore whether a curricu- 
lum based on the theory of successful intelligence is effective 
when implemented under conditions that would be typical if a 
district were to implement it on its own (1.e., without special 
support from the developer or research team)! across a variety of 
circumstances (e.g., different student populations, different types 
of schools) and (b) provide an estimate of the robustness of the 
successful-intelligence instruction. In other words, the main ques- 
tion we sought to address was whether a curriculum based on the 
theory of successful intelligence would continue to be more effec- 
tive than instructions relying mostly on memory and/or analytical 
skills, when implemented on a large scale, with different types of 
students, school, and teachers. Notably, teacher training and on- 
going support provided by the research team were much more 
limited than in previous studies. 


Method 


Participants 


Given the scope of this study, our aim was to recruit schools 
representing a wide range of geographical locations (i.e., different 
states and different student populations: urban, suburban, and 
rural), ethnic-minority representation, and socioeconomic profiles. 
In total, 3,270 school districts across the United States were 
contacted about the program. The final sample included schools 
from 35 towns located in 11 counties of nine states (Alabama, 
California, Connecticut, Massachusetts, Minnesota, Kansas, North 
Carolina, South Carolina, and West Virginia). We worked in 14 
school districts represented by 113 elementary schools, 223 teach- 
ers, and 223 classrooms. We entered information on 7,702 student 
participants, and obtained usable data (i.e., complete pre- and 


' Of course, only agreeable teachers of classrooms in volunteering 
schools within the district participated in each condition, and thus the 
ideal— district implementation—is approximated to varying extents. 
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posttest data) from 7,574 students. Some students received more 
than one unit of instruction, but the number of units administered 
and the order of the units were not fixed and varied depending on 
the fit between each school district’s prescribed content and the 
topics covered in our units. Correspondingly, here we present the 
analyses unit by unit, with the total number of observations at n = 
10,845. All students were fourth graders. 

Parents and caregivers were informed of the instructional inter- 
vention being implemented in their children’s classrooms and that 
the intervention had been endorsed by the school’s district super- 
intendent, principal, and classroom teacher. To facilitate broad 
acceptance, we kept the information collected on individual stu- 
dents to a minimum. Demographic information was thus obtained 
at the school level. In total, 49.6% of the students in the schools 
who participated were girls and 27.8% were underrepresented 
minorities. A further breakdown of the distribution of demographic 
information across schools by condition is provided in Table 1 
(Table 5 provides additional demographic information as it relates 
to specific units). Study conditions (successful intelligence = SI, 
critical thinking = CT, memory = M, and teaching as usual = 
TAU-control) were randomly assigned to schools. Random assign- 
ment at the school level was chosen to avoid contamination within 
a same school building, in which teachers and students naturally 
talk to each other and share learning materials. In larger districts 
equal numbers of schools were assigned to each condition, and in 
small districts, with fewer than three schools participating, the 
assignment was random. 

The guiding principles behind the critical thinking and memory 
conditions were drawn, respectively, from the education literature 
(for an introduction, see Halpern, 1996) and research on memory 
and mnemonic techniques (for an overview, see Baddeley, Ey- 
senck, & Anderson, 2009). As illustrated in Table 2, there is some 
overlap of activities across conditions, because critical thinking 
and analytical thinking in the theory of successful intelligence are 


Table | 
School-Based Demographic Information Across 
Intervention Conditions 





Intervention condition 











Variable SI Cis M TAU-control 
% female 
M 49.63 48.21 48.71 47.85 
SD Zhai 3.06 2.43 
% Asian 
M 3.83 2556 3.89 7.38 
SD 4.36 3.56 5.12 
% Black 
M 20.39 ORAL 28.46 11.38 
SD 15.40 14.11 27.01 
% Hispanic 
M 8.97 15.50 9.58 10.77 
SD 12.28 29.79 Ne 
% White 
M 66.80 71.23 58.06 70.46 
SD 22.24 28.74 29.13 
No. schools 43 40 30 1 
No. classes 100 65 55 3 





Note. SI = successful intelligence; CT = critical thinking; M = memory; 
TAU-control = teaching as usual control. 


similar constructs, and because the corresponding condition in- 
cluded memory activities. The main difference between SI, CT, 
and M curricula, then, is that the first balances an array of activities 
whereas the latter two focus on one particular approach (CT or M). 
Overall, the three versions (SI, CT, and M) have comparable 
amounts of student activities and require the same duration of 
classroom time and student time on task to cover the content. Thus, 
in one case (SI) there is a mixture of different types of activities. 
In the two other conditions (CT and M) there are more CT and M 
activities, respectively, and the creative and practical activities that 
were present in SI are absent. 

The study’s material development and data collection phase 
took five years to complete.* 


Materials 


Teaching units. Lesson materials (hereafter, units) were de- 
veloped for three academic domains (language arts, mathematics, 
and science) and for different content (e.g., within science there 
were units on ecology, electricity, light, and magnetism) in a 
similar manner, equalizing the engagement of targeted skills across 
the experimental conditions. Each unit was preceded and followed 
by unit-specific pre- and posttests. The content was based on a 
thorough review of the standards of each participating state at the 
time of the creation of the curriculum. We focused on those 
content topics that a majority of states suggested should be covered 
in their curriculum at the fourth-grade level. In some cases we 
selected a topic that was targeted in Grade 4 in one state but in 
Grade 3 in another state. There was never more than a one-year 
discrepancy between the topics, however, and, when present, the 
discrepancy did not influence participation in the study. 

The curricula in each of the three instructional treatment con- 
ditions (SI, CT, and M) were similar in that they covered the same 
concepts (e.g., magnetism), contained equal amounts of student 
activities, and required exact or comparable amounts of classroom 
instruction and student engagement. They were different, however, 
in the manner that the concepts were approached, presenting 
student activities that combined analytical, creative, practical ap- 
proaches to learning (in the SI condition); or a majority of analyt- 
ical approaches (in the CT condition); or a majority of activities 
encouraging memorization (in the M condition). Because the SI 
instructional approach also engages students in critical thinking, 
there were some activities that were offered both in the SI curric- 
ulum and in the CT curriculum. The same holds true for the 
memory-based activities that were offered in all three instructional 
approaches. Table 2 provides an example of how the activities 
differed in the three instructional approaches: Students in the SI 
condition had an analytical activity and one practical activity, 
students in the CT condition had two analytical activities, and 


? Due to the magnitude and duration of the study, various preliminary 
reports of the data were produced. These reports included different sub- 
samples of the study or presented analyses of the data in a variety of 
different ways (e.g., year by year of the study), using different data-analytic 
approaches, or with a variety of different software. Inevitably, there are 
differences between the obtained results, although all of the previous 
analyses have pointed to the advantage of the successful intelligence 
condition. This is the first presentation of the whole sample, where the 
analyses were carried out in the most conservative way, unit by unit, across 
all years of the implementation, utilizing a single analytic framework. 
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Table 2 
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Illustration of How Units in the Three Instructional Conditions Covered the Same Content for Students but With Different 
Instructional Approaches and Activities in the Language Arts Unit on Biography as a Literary Genre (1 Day) 





Successful intelligence 


Students will be able to 
(a) explain what a biography is, 

(b) identify and interpret life events, 
given a biographical statement, 
(c) compose (orally and in writing) 
one-sentence biographical 

statements. 


¢ [Analytical activity]: Given a short 
biography, identify and categorize 
life events using a graphic 
organizer 

e [Practical activity]: Write 
biographical statements about 
friend or family members 


Critical thinking 


Objectives 


Students will be able to 
(a) explain what a biography is, 

(b) identify and interpret life events, 
given a biographical statement, 
(c) compose (orally and in writing) 
one-sentence biographical 

statements. 


Activities 

e [Analytical activity]: Given a short 
biography, identify and categorize 
life events using a graphic 
organizer 

¢ [Memory activity]: Write 
biographical statements about the 
subject of a biography 


Memory 








¢ Genre: Biography 

¢ Description and interpretation of 
text 

¢ Sentence writing 


Skills 


¢ Genre: Biography 

¢ Description and interpretation of 
text 

e Sentence writing 





Students will be able to 

(a) define what a biography is, 

(b) identify life events, given a 
biographical statement, 

(c) compose (orally and in writing) 
one-sentence biographical 
statements. 


¢ [Memory activity]: Given a short 
biography, recall life facts, using 
notes and a frame as memory 
aids 

¢ [Memory activity]: Write 
biographical statements about 
the subject of a biography 


* Genre: Biography 
¢ Description of text 


¢ Sentence writing 


students in the M condition had two memory activities. Table 3 
provides the specific activity instructions for the classroom 
teacher. 

Language arts curriculum units. Five thematic language- 
arts units were completed by the students in each of the three 
conditions (SI, CT, and M). These five units were titled (a) How 
and Why Nature Tales (Wonders of Nature); (b) Informative 
Nonfiction (True Wonders); (c) Biography (Lively Biographies); 
(d) Quest Literature (Journeys); and (e) Mystery (/t’s a Mystery). 
Thus, in total there were 15 instructionally customized newly 
developed units (5 content units X 3 treatment conditions). Al- 
though the content and duration of each unit were identical, within 
each condition, each unit was taught with different techniques 
based on the SI, CT, and M specifications. Students across the 
three conditions received the same, unit-specific pre- and posttest 
assessments. That is, there were five pre—posttest pairs correspond- 
ing to the five content units. 

Intended as an introductory unit, The Wonders of Nature intro- 
duced students to two short poems about nature, which served to 
motivate students to “wonder” about the natural phenomena ex- 
plained in pourquoi (“how and why”) tales. Students were taught 
to identify the characteristic elements of pourquoi tales, including 
the concept of cause and effect. As a culminating activity, students 
were expected to write their own pourquoi tale. 

In True Wonders, students learned library research skills. They 
were expected to develop an understanding of research methods, 
understand the difference between fiction and nonfiction, and learn 
to use reading strategies to synthesize information from nonfiction 
sources. 

In Lively Biographies, students were exposed to biography as a 
genre. They engaged in a series of activities that helped them to 
develop a working knowledge of the nature of the genre, the 


sequencing events in chronological order, and the use of graphic 
organizers in the recording of events. Students then interviewed 
someone and produced a photo-biography. 

In Journeys, students were engaged in the reading of quest tales 
and, through a series of activities, gained an understanding of the 
elements of the quest tale. Students were expected to articulate 
universal themes, identify and articulate qualities of quest heroes, 
and demonstrate knowledge of the above through the writing 
process. 

Finally, in the /t’s a Mystery unit, students listened to a read- 
aloud mystery and at the same time independently read a mystery 
of their choice. Through activities based on the readings, students 
gained an understanding of the mystery genre, including how 
suspense and intrigue are built. For example, students identified 
the setting, characters, plot development, conflict, and resolution; 
learned vocabulary common to the genre; discussed human expe- 
riences and motives; and followed clues to solve the mystery. 
Usable data were collected for all five units (see the Note on 
missing data section below). 

Mathematics curriculum units. Five mathematics units, in- 
cluding pre- and postintervention assessments, were completed in 
each of the four conditions (SI, CT, M, and TAU-control)*: (a) 
Equivalent Fractions, (b) Measurement, (c) Geometry; (d) Data 
Analysis and Representation; and (e) Number Sense and Place 
Value. Thus, in total there were 20 customized instructional units 


*In our work with multiple districts and schools around the country, we 
established, due to the wide range of content covered in the various curricula used 
across the country, that the only domain in which we could implement a TAU 
condition was Mathematics. The diversity of curricula, pedagogies, and standards 
was too great in the domains of Language Arts and Science to justify a homoge- 
neous TAU condition. 
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Table 3 
Detailed Descriptions of the Analytical, Practical, and Memory Activities Cited in Table 2 


Analytical activity: Listening for life facts— 


After students demonstrate an understanding of biography, move on to an example of a biography. 
Biographical statement model 


Ask the students to listen to the biography and try to identify life facts, such as date and place of 
birth, what the person looks like, or what the person accomplished. Write the biography on chart 
paper so the children can follow along while you read. 

Sample biography: Mrs. Murray was born in San Diego, California, and learned to swim almost 
before she learned to walk. Her older brother Peter taught her to swim at the marina where their 
dad worked as a lifeguard. As a youngster, Mrs. Murray liked to race her brother and the other 
children who swam at the marina. She was a tall, athletic youngster who kept her long, blond hair 
tied back in a ponytail. Her family was not at all surprised when she joined the high school swim 
team and won many medals. Today when she is not teaching her fourth grade class, Mrs. Murray 
still enjoys swimming and teaching her own children to swim at the local beach. 

Then ask the children to share “life facts” they learned about the person from hearing the brief 
biography. For example, they might share that Mrs. Murray is a good swimmer or that she was 
born in California. You might want to point out that biographies are usually written in the third 
person because they are about someone else's life story. As the children share what they can 
remember about your life, write the “life fact” under the appropriate heading on the tag board 
chart. Use the category labels to prompt the children to remember life facts they heard. Tell them 
they can use the BIOgraphic organizer as a guide while they are reading. 

Note to teacher: A classroom wall chart—a BIOgraphic organizer—can be made out of tag board or 
flannel board for repeated use throughout the lesson. Ideally, it should be created as a pocket chart 
so that students can post their sentence strips to sort life facts throughout this unit. Category 
headings (e.g., accomplishments, appearance, family, friends, occupation) may be changed to fit 
reading passages. A similar matrix will be used as an advance organizer throughout the unit to 
assist students in reading biographies for life facts. 

After reading and discussing the model biography, tell the students they will finish a brief practical 
activity in which they will become a biographer. Tell the students that they will be doing a short 
activity in which they will select something memorable about a person they know well and write 
one biographical statement about that person. Ask the students to think of someone they know 
well. You may want to prompt the students to remember different aspects of the person’s life by 
slowly asking them a series of “remember” questions. Tell the students to close their eyes and try 
to remember what the person looks like, how old the person is, what the person wears, what the 
person likes to do, where the person works or goes to school, what friends and family members 
the person has, and what interesting or memorable things the person has done. 

Then ask the students to select one interesting memory and write one statement about this person. 
Students should write their sentences on a sentence strip/oak tag so that the sentences can be 
saved and referred to in future lessons, as necessary. Classroom paraprofessionals may be 
involved in helping the students write a complete sentence and/or checking for correct spelling 
and punctuation. These sentences will serve as models of short biographical statements. They will 
also serve as examples of “life facts” or the kinds of information a biography typically tells about 
a person’s life. 

After students demonstrate an understanding of biography, move on to an example of a biography. 
Ask the students to listen to the biography and take notes to memorize life facts, such as date and 
place of birth, what the person looks like, or what the person accomplished. Write the biography 
on chart paper so the children can follow along while you read. 

Sample biography: Mrs. Murray was born in San Diego, California, and learned to swim almost 
before she learned to walk. Her older brother Peter taught her to swim at the marina where their 
dad worked as a lifeguard. As a youngster, Mrs. Murray liked to race her brother and the other 
children who swam at the marina. She was a tall, athletic youngster who kept her long, blond hair 
tied back in a ponytail. Her family was not at all surprised when she joined the high school swim 
team and won many medals. Today when she is not teaching her fourth grade class, Mrs. Murray 
still enjoys swimming and teaching her own children to swim at the local beach. 

Remove the biography and ask students to recall the main life facts they just heard. Ask students to 
review their notes, set them aside, and then recall the main facts about the person in the 
biography. 


Practical/creative activity: Writing biographical 
statements about friends and family 


Memory activity: Recall life facts 





Note. Text in italics is for the teacher. 


(5 content units X 4 treatment conditions); within each condition, fractions (denominators less than 12), and applied the concept of 


each unit was taught using different techniques based on the SI, 
CT, M, and TAU-control specifications. However, there were only 
5 pre—posttest pairs, as students across the four conditions received 
the same pre- and posttest assessments. 

The Equivalent Fractions unit was intended as a follow-up to an 
introductory fractions unit. In it students developed an understand- 
ing of the concept of equivalence, modeled equivalent fractions 
with concrete manipulatives, identified and generated equivalent 


equivalent fractions in practical and problem-solving situations. 

In the Measurement unit, students learned to measure quantities 
(including time, length, perimeter, area, weight, and volume) in 
everyday and problem situations. They compared, contrasted, and 
converted within systems of measurements (customary and metric) 
and estimated measurements in everyday and problem situations. 
In addition, students learned about the use of appropriate units and 
instruments for measurement. 
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In Geometry, students engaged in the identification and modeling of 
simple two-dimensional and three-dimensional shapes and developed an 
understanding of their properties (reviewing perimeter, area, and volume). 
Students were expected to understand and identify geometric concepts 
such as “congruent,” “similar,” and “symmetric.” Finally, students com- 
bined, rotated, reflected, and translated shapes. 

In Data Analysis and Representation, students were given an 
opportunity to collect, organize, and display data from surveys, 
research, and classroom experiments. They used the concepts of 
range, median, and mode to describe a set of data and to interpret 
data in the form of charts, tables, tallies, and graphs. They learned 
about the use of bar graphs, pictographs, and line graphs and the 
advantages and disadvantages of each. 

In the Number Sense and Place Value unit, students used num- 
ber lines to identify and understand negative numbers and the 
ordering of numbers. They were led to an understanding of how to 
use the place-value structure of the Base 10 number system and 
how to identify factors and generate equivalent representations of 
numbers to use in problem solving. In addition, students explored 
even/odd numbers, square numbers, and prime numbers. 

Usable data were collected only from three of the units: Equiv- 
alent Fractions, Measurement, and Geometry (see the Note on 
missing data section below). 

Science curriculum units. Four science units, including pre- 
and postintervention assessments, were completed in each of three 
conditions (SI, CT, and M): (a) The Nature of Light; (b) Magne- 
tism; (c) Electricity; and (d) Ecology. In total, there were 12 
customized instructional units (4 content units < 3 treatment 
conditions); within each condition, each unit was taught with 
different techniques based on the SI, CT, and M specifications. 
There were only 4 pre—posttest pairs, as students across the three 
conditions received the same pre- and posttests. 

The Nature of Light unit introduced the concepts of light, 
reflection, and refraction. By the end of this unit, students were 
able to show that light travels in straight lines; give examples 
illustrating that visible light is made of different colors; list colors 
of visible light; explain how a prism can separate visible light into 
different colors; explain how mirrors can be used to reflect light; 
give examples of absorption; describe and give examples of re- 
flection; give examples and describe refraction; and describe the 
similarities and differences between absorption, reflection, and 
refraction. 

In the Magnetism unit, students learned the properties and uses 
of magnets. By the end of this unit, students were able to explain 
the difference between magnetic and nonmagnetic objects; give 
examples of magnetic and nonmagnetic objects; define magne- 
tism; predict whether two magnets will attract or repel each other; 
describe the effects of a magnet on a compass; explain the differ- 
ence between temporary and permanent magnets; define the terms 
lodestone and keeper as they apply to magnetism; illustrate that the 
magnetic force is strongest at the poles; and identify materials that 
may interfere with a magnetic field. 

In the Electricity unit, students were engaged in hands-on ac- 
tivities relating to electrical circuits. By the end of this unit, 
students were able to explain that static electricity occurs when 
charges are moved from one object to another; give examples of 
static electricity; explain how an object can become charged; 
define what a cell is; explain the relationship between a cell and a 
battery; explain what current electricity is; list the essential com- 


ponents of a series circuit; explain how a series circuit works; 
explain how a parallel circuit works; explain the difference be- 
tween a series circuit and a parallel circuit; explain what conduc- 
tors are; explain what insulators are; and give examples of insu- 
lators. 

In the Ecology unit, students were provided with a basic under- 
standing of the interdependence of organisms and their environ- 
ments through a series of activities focusing on environmental 
factors and their impact on animals and people, respectively, and 
the interdependence of plants and animals. In addition, students 
developed the skills necessary to conduct scientific investigations 
and gain an appreciation for science as a discipline. By the end of 
the unit, students were able to explain what a terrarium is; describe 
some environmental factors that are important to the growth and 
survival of plants and animals; give examples of how environmen- 
tal factors affect the growth and survival of plants; explain how. 
animals depend on the nonliving environment to survive; describe 
some environmental factors that affect animals’ ability to survive 
and grow; give examples of the effect of the same environmental 
factor on different animals; describe an ecosystem; explain some 
of the relationships between plants, animals, and the physical 
environment; explain how energy passes through an ecosystem; 
describe the conditions that are necessary for an ecosystem to 
function; explain how people depend on their environment; give 
examples of how people can have a positive or negative effect on 
their environment; and understand why it is important to use 
natural resources wisely. Only two units, The Nature of Light and 
Magnetism, produced usable data (see the Note on Missing Data 
section below). 

Assessments. Unit-specific assessments were developed to 
capture mastery in the content area of each unit but were generated 
in such a way that equal numbers of items tapped into the four key 
abilities at which the intervention conditions were aimed—that is, 
memory, analytical, creative, and practical abilities (Randi & 
Jarvin, 2006). Each pre- and posttest had 20—22 items. In order to 
equalize test difficulty statistically and place pre- and posttest 
scores on the same measurement scales, we included 3-7 items 
that were common to both pre- and posttest in each unit. These 
items were used to obtain ability scores (see below). 

Initial rubrics were developed by the research team for all of the 
units’ pre- and postintervention assessments; they were then re- 
fined in collaboration with several raters once initial student data 
had been collected. All the student data were then rated with the 
final rubrics. The items were roughly equally divided between 
multiple-choice (scored 0-1) and open-ended (scored 0-5) for- 
mats, with 40% to 59% identified as multiple-choice items, de- 
pending on the test. Students in all conditions received identical, 
unit-specific pre- and posttests. Table 4 presents the Cronbach’s 
alpha internal consistency reliability estimates, for pre- and post- 
tests, for both multiple-choice and open-ended questions simulta- 
neously (Rizopoulos, 2006). 


Procedures 


Assignment to experimental groups. Recruitment efforts 
were targeted at school districts rather than at individual schools, and 
we sought permission and buy-in from district superintendents before 
reaching out to principals. Depending on the size of the district and the 
number of schools judged by the superintendent to be candidates for 
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Table 4 
Internal Consistency (a) and Construct Reliability (Con. r,..) of Curriculum Unit Pretests and Posttests 
Pretest Posttest 
: : Common 
Curriculum units Items a Conny Items a Conn! items n 

Language Arts 

How and Why Nature Tales (Wonders of Nature) 22 0.767 0.991 22 0.826 0.995 6 1,626 

Informative Nonfiction (True Wonders) 22 0.786 0.993 2D 0.793 0.995 4 1233 

Biography (Lively Biographies) 22 0.845 0.992 22 0.778 0.990 7 752 

Quest Literature (Journeys) 22 0.783 0.990, Zp 0.832 0.991 3 520 

Mystery (It’s a Mystery) 22 0.813 0.992 22 0.803 0.988 Fi 549 
Mathematics 

Equivalent Fractions DD 0.816 0.997 22 0.748 0.994 5) 1,735 

Measurement 22 0.698 0.992 20) 0.739 0.993 3 1,550 

Geometry oP) 0.659 0.990 22 0.775 0.992 3 545 
Science 

The Nature of Light 20 0.876 0.991 20 0.848 0.980 6 1,328 

Magnetism 20 0.762 0.986 20 0.646 0.982 2 Oe 











Note. Con. r, = construct reliability of factors jointly estimated with common item anchoring. 


participation, one or more experimental conditions were implemented 
in the district. In all cases, teachers within a given school were 
assigned the same condition to avoid within building contamination. 
In other words, in small districts there might be only one participating 
school, so that the district is confounded with the experimental con- 
dition, whereas in a larger district, all conditions might be assigned, 
always to different schools. Within these constraints, the allocation to 
experimental condition was random. This design reflects the chal- 
lenges and constraints of large-scale implementation in diverse 
settings, where districts and schools need to have voices in making 
decisions about the experimental interventions they are interested 
in considering. In other words, administrators decided if the dis- 
trict should participate, and if so, which schools should be in- 
volved, but they did not select the experimental condition(s) to be 
implemented. Although it is difficult to ascertain the full impact of 
the final allocation of schools to condition, our analyses include 
pretest scores as a covariate. This is in part to address concerns that 
even perfectly random allocation does not ensure a balance of 
student attributes across conditions. Yet another challenge was to 
get all of the participating teachers to implement all of the instruc- 
tional units. Although upon recruitment districts committed to 
working with the whole curriculum (i.e., all units), the delivery of 
the full curriculum across all participating schools proved impos- 
sible due to differences between schools in terms of the content 
that they wanted to prioritize at the given grade level, scheduling 
issues due to local tests and other required activities, as well as 
differences among classrooms in terms of student level and speed 
of progression through instructional materials. 

Teacher training. A 2-day, 12-hour in-service training pro- 
gram was developed and implemented by members of the research 
team for all the participating teachers. The workshop was tailored 
to the experimental condition that the participating teachers had 
been assigned to (i.e., SI, CT, or M). 

Day 1 focused on (a) the program design, teacher requirements, 
and other logistics and (b) the theoretical principles of teaching 
and instruction for each one of the three experimental conditions. 
After introductions, teachers were presented with a program over- 
view and the timeline and expectations for participation were 
reviewed and discussed as a group. The researchers then presented 


the theoretical underpinnings and prior empirical evidence for the 
effectiveness of the approach (SI, CT, or M). Teachers in the SI 
condition thus learned about the previous studies on the effective- 
ness presented in the introduction to this article; teachers in the CT 
condition were given examples of critical thinking based instruc- 
tional interventions, and teachers in the M condition were taught 
about the effectiveness of different mnemonic strategies for learn- 
ing material. In addition to learning about earlier work, partici- 
pants got to practice activities that had proven successful. Again, 
specific activities practiced varied between the SI, CT, and M 
groups. Finally, teachers practiced hands-on use of the CORE 
system. CORE (Collaborative Online Research Environment) is a 
software package that was designed specifically for this program 
to allow teachers to access, download, and print curriculum ma- 
terials, as well as to provide a discussion board allowing them to 
chat both with other teachers enrolled in the same condition (SI, 
CT, or M) and with the curriculum developers and content spe- 
cialists involved in the program. 

Day 2 focused on modeling the units in each subject area and 
provided teachers with an opportunity for hands-on experience 
with the unit format. Materials distributed included a teacher guide 
containing instructional material, background information, re- 
source materials reflecting print and nonprint sources, and student 
workbooks. Teachers received only materials relevant to the in- 
structional approach they were to implement in their classroom. In 
other words, a teacher trained to implement the SI instructional 
approach was trained with other teachers implementing the SI 
approach and saw only the SI instructional materials. Teachers also 
were introduced to the instructional strategies particular to the 
condition. An overview of the pre- and postintervention assess- 
ments concluded the sessions. 

Fidelity monitoring. Fidelity monitoring was carried out in 
two ways: through the CORE system and by collecting and 
reviewing all student workbooks to track the level of comple- 
tion. As mentioned above, the CORE system is a Java-based 
collaborative environment designed to establish and promote 
long-distance collaborations with teachers. It was designed, 
created, and maintained by the Yale University Information 
Technology Department for the purposes of this program and 
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enabled the research team to stay in touch with implementing 
teachers throughout the school year. Because all electronic 
conversations between teachers and between teachers and 
research-team members were recorded and stored, the system 
provided data to measure fidelity of implementation. A second 
measure was provided by the collected student workbooks, 
which contained information on which part(s) of a curriculum 
unit and what activities had been completed by the students in 
a given classroom. Both teacher logs and student workbooks 
were analyzed for indicators of fidelity; only those teachers 
whose students completed all homework assignments, and 
whose CORE logs were indicative of both understanding of and 
adherence to the program, were included in the data analyses. 
We did not have reason to expect (and did not observe) any 
differences in the usage of the CORE system and in the utili- 
zation of the workbooks across instructional treatments (SI, CT, 
and M). In other words, there were differences across class- 
rooms, with some but not other teachers utilizing the CORE 
system regularly and some teachers returning student work- 
books where every activity had been completed and other 
teachers returning student workbooks where entire sections 
were blank, but these differences were observed within each 
instructional treatment condition. Student workbooks were used 
as indicators that permitted a participating classroom to be 
entered in the study database. If the workbook contained less 
than 70% of the activities completed, the data from a given 
teacher were not entered into the database. Altogether, ~10% 
of the participating classrooms in each instructional condition 
did not meet this criterion. 

Data processing. All data processing was carried out at the 
Center for the Psychology of Abilities, Competencies, and 
Expertise (PACE Center) at Yale University. Details regarding 
the management of the data can be found in the Appendix. 
Close to one hundred casual employees were hired in addition 
to permanent research-assistant staff to assist with data entry 
(multiple-choice questions) and coding of open-ended ques- 
tions. The open-ended questions were coded with a detailed 
rubric developed by the curriculum developer, and coders were 
trained to reach satisfactory interrater reliability levels (1.e., the 
correlations between the pair’s open-ended item ratings had to 
be greater than .70) before they were allowed to start coding 
materials. 


Statistical Analyses 


Note on missing data. As we worked with a large number of 
schools and districts, we could exercise only limited control over 
what and how many units were selected by teachers to be admin- 
istered. Buy-in required a commitment to the whole program, but 
teachers needed to map their preferences for particular units onto 
their school calendars and other administrative demands. In turn, 
to include a unit into the analyses, we had to have a reasonable 
number of students receiving the unit across all study conditions. 
Unfortunately, this did not happen for two Mathematics and two 
Science units. Due to small or distinctly uneven distributions of the 
number of participants across conditions within certain units, the 
corresponding data were not analyzed for those four units. 

Attrition. Extending our reporting of fidelity monitoring, a 
certain degree of student attrition is also expected as students come 


and go throughout the school year due to illness and the like. Some 
students may also not be available for testing at one or the other 
assessment or may have joined the class part way through the 
training. An analysis of attrition revealed statistical differences in 
six of the nine units,* although effect sizes (7) are small with no 
consistent pattern for any one condition. There was statistically 
less attrition in the SI condition for three units (n*: Equivalent 
Fractions = .005; Measurement = .006; and Magnetism = .047), 
less attrition in the M and CT conditions for three units (7: True 
Wonders = .042; The Wonders of Nature = .015, and mysteries 
.028), and no statistical differences in attrition for the remaining 
three units. Of importance, these differences were not related to the 
intervention differences to be reported shortly. Only students who 
were available for assessment at both time points were included in 
the analyses. 

Overview of analyses. The analyses we report here were 
conducted in two stages. First, we derived performance measures 
for each unit. Second, we ran unit-specific analyses that included 
a set of covariate and interaction terms.° The rationale and general 
approach for these are described next. 

Derivation of performance measures. To combine multiple- 
choice (binary) and open-ended (ordinal) items into a single ability 
score, we used Samejima’s graded response model (Samejima, 
1997), as implemented in Mplus (Muthén & Muthén, 2005), for 


‘both pre- and posttest data (such scores have a range of approxi- 


mately —3 to 3). For the overlapping items that were presented 
both at pre- and posttest, their loading and threshold (i.e., their 
discrimination and difficulty parameters) were constrained. This 
allowed for the statistical equating of pre- and posttest item diffi- 
culty. As recommended in the literature (Geiser, Eid, Nussbeck, 
Courvoisier, & Cole, 2010), scores were calculated for only those 
individuals with both pre- and posttest data. We do not elaborate 
on these analyses here; however, details can be obtained from the 
authors. Traditional internal consistency measures for the pre- and 
posttest assessments of each unit are provided in Table 4, along 
with construct reliability estimates (Gefen, Straub, & Boudreau, 
2000). 

Unit-specific analyses with covariates. As we have de- 
scribed, the students who participated in the current study were 
sampled from a large and diverse population. One of the touted 
benefits of a cognitive approach to educational interventions is the 
real possibility of capturing a much broader and diverse range of 
approaches to learning. This has certainly been our general expe- 
rience in the smaller scaled applications of the theory of successful 
intelligence. One difficulty we faced in the current study is that 
student-level diversity (e.g., gender and ethnicity) was not col- 
lected for reasons described previously. We attempt to capture this 
diversity and the differential extent that it may impact performance 
across condition by using a number of school- and classroom-level 
covariates. The diversity we are capturing is thus in terms of the 
educational environment, not the child’s specific circumstances. 


* Attrition here is defined as data not available at either pretest or 
posttest. 

° These analyses are the culmination of a comprehensive series of 
analytics conducted in a number of passes across this large database. We 
acknowledge the reviewers’ significant input in shaping the final set we 
report here. 
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Following the derivation of measures for each educational unit (5 
Language Arts, 3 Mathematics, and 2 Science units), a series of 
mixed-effects (multilevel) regressions was fit to estimate the effect 
of intervention condition on the posttest performance. The pretest 
was always included as a covariate in the regressions. To evaluate 
the robustness of the obtained results, we repeated the analyses 
using, inter alia, alternative centering (group mean), a different 
random clustering variable (school rather than teacher), and the 
propensity scores approach to match the experimental groups as 
closely as possible (Dehejia & Wahba, 2002; Ho, Imai, King, & 
Stuart, 2007). Although there was some variability in the findings 
(i.e., the magnitude of effects), the pattern of results was generally 
consistent.° The approach we used for the analyses reported here 
is as follows: There were two levels in the multilevel analysis: 
students at Level 1 clustered within classroom teachers at Level 2. 
That is, random effects (covariates and intervention conditions) 
were estimated at the teacher level (Level 2) to account for 
classroom level clustering. Students’ posttest and pretest perfor- 
mances were modeled at Level 1. Where statistically possible, all 
models included critical classroom- and school-level demographic 
variables and their interaction with experimental condition. Title I 
status,’ gender (defined as the proportion of the school population 
that was male; i.e., % male) and % White (proportion of the school 
population that was White) were school-level variables, and gift- 
edness (whether the class was identified as a regular or gifted- 
education classroom) was a classroom-level variable. The % male 
and % White variables were grand-mean centered for entry alone 
and as part of interaction terms. It is conceivable to introduce 
school variability as a third level in the model by clustering 
classrooms within schools. However, the distribution of the num- 
ber of classes across schools and intervention conditions was quite 
broad—on average there were only 1.63 classrooms per school 
(standard deviation = .45). This suggested to us (and was sup- 
ported by our preliminary analyses) that the school-level variables 
would provide little additional statistical information (in relation to 
their association with student performance) if they were modeled 
at the school level, rather than at the classroom/teacher level. 
Furthermore, given the limited variability in number of classrooms 
per school, a three-level model would be unstable. As such, the 
decision was made to stay with the simpler two-level model. The 
regression models were fit in R with the nonlinear mixed effects 
models (NLME) package (Pinheiro, Bates, DebRoy, Sarkar, & the 
R Core team, 2009). Note that the NLME package accommodates 
both linear and nonlinear models; however, in the present study 
only linear models were run. We treated intervention conditions as 
multiple, dummy coded variables (with SI as the reference group) 
in the analyses for each unit. In one or more conditions of some 
units, covariates were constants or zero. They were excluded from 
analyses when this occurred. 


Results 


Sample Data 


Table 5 presents descriptive statistics of the unadjusted pre- 
test and posttest performance scores, and the characteristics of 
the sample by study condition. Of note is the large variability in 
sample characteristics among the different conditions and dif- 
ferent units. This reflects the realities of conducting research 
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during real-time classroom teaching using intact classrooms. To 
control statistically for this variability, we fit regressions sep- 
arately for each unit and included pretest as a Level 1 covariate 
and demographic variables (Title I, % male, % White, and 
giftedness) as Level 2 covariates. Interaction terms were also 
entered when possible to capture (in part) variability in the 
differential functioning of covariates between conditions. All 
models were run with the same set of covariates first, and for 
those models that would not statistically converge with all 
covariates, the models were modified. Covariates not able to be 
included for a particular model are represented with a dash in 
Table 6. Regressions were fit with varying intercepts and were 
grand-mean centered (Title I and giftedness indicators were not 
centered because these are binary variables). The analyses, 
which included the intervention condition coded into multiple 
dummy-variables with SI as the reference group (i.e., CT vs. SI, 
and M ys. SI, and, in addition for Mathematics units, TAU vs. 
SI), revealed the following results. First, all unit analyses 
included the student-level pretest score as a covariate, and in all 
cases, as would be expected, it was a statistically significant 
predictor of posttest performance. We report unstandardized 
regression coefficients in Table 5 and the graphical represen- 
tation of this data in Figure | (along with 95% confidence 
intervals). Below is a summary of the results for each academic 
domain. 


Units 


Language arts units. There were five language arts units 
that had analyzable data. Three of the five had a statistically 
significant effect for intervention condition. Controlling for 
student pretest score and school-level covariates (gender, % 
White, and Title I, and their interaction with condition) there 
was a Statistically significant advantage to the SI condition over 
the CT condition in Wonders of Nature (b = —0.86, p = .05) 
and Journeys (b = —0.29, p = .02). CT was superior to SI in 
Mysteries (b = 0.81, p = .01). There were no statistically 
significant intervention effects for any of the other Language 
Arts units. 

Mathematics units. Three mathematics units had analyz- 
able data. Two of the three had statistically significant effect for 
intervention condition. Controlling for pretest performance and 
Level 2 covariates, statistically significant intervention effects 
were observed for Equivalent Fractions in favor of SI over 
TAU (6 = —0.27, p = .O1) and for Measurement in favor of 
Memory over the SI intervention (b = 0.28, p < .04). There 
were no statistically significant intervention effects for 
Geometry. 

Science units. There were two science units that had ana- 
lyzable data, and both had statistically significant effect for 
intervention condition. For The Nature of Light unit, there was 
a significant intervention effect in favor of SI over Memory 
(b = —0.78, p < .01). For Magnetism, there was a significant 


© All of these results, as well as the details of the results presented in this 
article, are available from the authors upon request. 

7 We used Title I data (http://nces.ed.gov/) for each school as a proxy for 
socioeconomic status. 


. 


892 STERNBERG ET AL. 


Table 5 
Descriptive Characteristics of the Study Groups 


Test results* Demographic characteristics” 


Pretest 


Curriculum units M SD 





Posttest 
M SD N % girls % White Title I Giftedness 





Language arts 


How and Why Nature Tales (Wonders of Nature) 





SI 0.01 0.87 0.54 1.35 703 SI 63.2 48.1 26.9 
Cr —0.02 0.87 0.43 1.13 542 47.5 88.2 30.8 39:1 
M 0.18 0.97 —0.02 1.30 436 49.1 63.6 88.1 24.5 
Informative Nonfiction (True Wonders) 
SI 0.03 1.03 0.11 0.84 519 50.2 73.2 31.4 34.9 
Cy —0.08 0.69 0.02 0.63 S77) 47.8 88.2 29:2 34.0 
M —0.24 0:95  —0.09 0.81 Bou 49,2 64.6 82.8 28.2 
Biography (Lively Biographies) 
SI —0.09 0.98 0.00 0.41 340 53.6 72.4 69.7 0.0 
en —0.09 0.87 —0.04 0.43 220 48.2 1957 56.4 0.0 
M 0) 0.91 —0.02 0.39 192 48.8 59.1 100.0. 0.0 
Quest Literature (Journeys) 
SI 0.03 0.91 0.20 0.82 322 55.0 68.8 56.8 0.0 
er —0.25 Od? —025 1.05 144 49.2 55 45.1 0.0 
M —0.11 0.72 0.12 0.74 89 eS) 83.8 100.0. 0.0 
Mystery Ut’s a Mystery) 100.0 
SI —0.16 0.75 0.08 0.41 232 52.1 62.9 90.5 0.0 
Cie —(0).59 elise O05) 0.42 157 48.8 88.2 B25) 0.0 
M =O 0.68  —0.02 1.30 160 50.0 76.9 100.0 0.0 
Mathematics 
Equivalent Fractions 
SI —0.19 0.89 —0.06 0.46 663 DO 74.8 24.1 eS 
Ci aa) eSi 0.94 —0.06 0.47 585 48.8 67.9 21.4 65.5 
M —0.40 0.81 (08) 0.43 451 50.3 74.5 36.8 47.5 
TAU —1.09 ODay 0510 0.34 36 47.9 70.2 100.0 0.0 
Measurement 
SI 0.16 0.81 0.09 0.92 548 50.4 78.0 19.0 59.7 
CE SOS 0.95 0.02 1.04 485 48.3 774 27.2 67.0 
M OMe 0.86 0.01 0.92 485 49.8 69.4 45.4 47.2 
TAU —0.85 0:93, —0.89 1.09 32 47.9 70.2 100.0 0.0 
Geometry 
SI —0.24 0:89 —0.20 0.69 284 50.1 54.1 68.7 0.0 
Cr 0.65 0.68 0.65 0.55 128 50.1 80.2 4.7 100.0 
M —0.10 O69 —0.07 0.68 103 47.7 67.2 100.0 0.0 
TAU S103 060s —0:56 0.75 30 47.9 70.2 100.0 0.0 
Science 
The Nature of Light 
SI 0.09 0.94 0.01 0.31 617 49.9 69.7 20.5 76.3 
Ch 0.17 0.86 0.08 0.34 444 49.1 sll ao) 81.3 
M —0.36 0.84 —0.16 0.27 267 47.3 62.7 30 63.7 
Magnetism 
SI 0.05 0.83 —0.08 0.72 345 52.5 65.4 0.0 84.6 
Cr —0.24 0.86 0.18 0.60 453 47.7 69.4 0.0 100.0 
M =088 0.98 0.03 0.57 119 47 79.7 0.0 100.0 


Note. SI = successful intelligence; CT = critical thinking; M = memory; TAU = teaching as usual control. 
“ The pretest and posttest scale is a function of Samejima’s graded response model (Samejima, 1997); 0 is defined as the average ability level for individuals 
as measured by the test. ° School-level data (average of the percentage of students in the school for a given characteristic). 


advantage for the critical thinking condition over SI (b = 0.32, 
p = .04). 


Summary of Analyses 


In sum, the analyses, which included the intervention condition 
coded into multiple dummy-variables with SI as the reference 
group, revealed 7 effects (out of 23) of mention. There were four 


cases where SI was advantageous (Wonders of Nature, Journeys, 
Equivalent Fractions, and Light), one case where Memory was 
advantageous (Measurement), and two cases in favor of Critical 
Thinking (Mysteries and Magnetism). This is not substantially 
different from what we might expect by chance. The SI interven- 
tion did not lead to an overall advantage as expected, but equally 
it did not lead to a disadvantage. 
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Language Arts 


Regression Coefficient 





Mathematics 


Science 





Curriculum Unit 


Figure 1. 


Regression coefficients and 95% confidence intervals of students in the SI condition relative to 


experimental conditions for each curriculum unit. The SI group is set at 0.00. Correspondingly, all units with 
coefficients below the 0 line indicate an advantage of the SI condition (and conversely for coefficients above the 
0 line). Conditions: SI = successful intelligence; CT = critical thinking; M = memory; TAU = teaching as 
usual. Language Arts units: wn = Wonders of Nature; tw = True Wonders; bio = Lively Biographies; jny = 
Journeys; mys = It’s a Mystery. Mathematics units: fra = Equivalent Fractions; mea = Measurement; geo = 
Geometry. Science units: lght = The Nature of Light; mag = Magnetism. 


The pattern of influence of the covariates, both alone and as 
interactions, is varied across interventions (see Table 6, covari- 
ates). This pattern attests to the diversity of variables that influ- 
ence, in complex ways, attempts to scale experimental investiga- 
tions of intervention effects into everyday contexts. Controlling for 
these demographic characteristics of the schools and classrooms 
using the data we have access to, the SI intervention was advan- 
tageous in each domain (Language Arts, Mathematics, and Sci- 
ence) but weakly and inconsistently so. 


Discussion 


Based on the data collected in previous studies and discussed 
in the introduction, teaching for successful intelligence has 
been shown to help improve instruction and assessment in a 
variety of disciplines at diverse grade levels (Grigorenko et al., 
2002; Sternberg et al., 1998, 2011). Most important, SI research 
has helped to provide a way of showing that if students are 
taught in a way that fits their ability profiles, they will achieve 
at higher levels and be better able to leverage their diverse skills 
(Sternberg et al., 1999). 

The range of results found in the current study across all units 
and conditions are discordant with our previous findings. That 
is, regardless of (a) the rigorous research design, (b) the sub- 
stantial resources invested by our team of highly skilled re- 
searchers drawn from around the world, as well as the numerous 
classroom teachers who invested time and energy to be in- 
volved, (c) the infrastructure available from one of the very best 
universities in the world in which the project was hosted, and of 
course (d) the recognition and support of the National Science 
Foundation (NSF) granting committee who invested in the SI 
theory to fund this large-scale research project, the results are 
sobering, especially in light of our previous successes. Because 
of the investments of the many stakeholders involved with the 


project, it is incumbent on us to reflect on the implications of 
these findings in relation to the future of SI theorizing and for 
educational research that aims to scale up interventions that 
have previously demonstrated advantages in small, controlled 
studies. In this regard we first consider the. future utility of the 
“economy of scale” argument, on which large-scale interven- 
tion studies are often grounded, and second reflect on the 
specific implications of scaling the SI intervention relative to 
the strong control interventions in regard to implementation 
fidelity. 


Economies of Scale: Is It a Viable Approach? 


One potential explanation for the observed results is that the 
attempt to apply economic theories and models to education may 
be fundamentally misguided. Many policymakers endorse a fac- 
tory metaphor for thinking about education, in which students are 
the “products” to be filled with knowledge and teachers are a 
means of production (see Madaus, Haney, & Kreitzer, 1992, for a 
description). The microeconomics concept of economies of scale, 
upon which the notion of scaling up educational interventions 
rests, has been demonstrated to be highly effective in the manu- 
facturing world (e.g., Henry Ford’s assembly line). However, 
Seddon (2010) has argued convincingly that economies of scale 
are not applicable in the context of human service professions, and 
educational delivery is arguably much more closely aligned with 
human services than with a factory metaphor. Further, as Elias et 
al. (2003) noted, one of the reasons that scaling up educational 
interventions is challenging is because educational interventions 
primarily rely on human operators rather than technologies. Teach- 
ers are not automatons that execute a standardized curriculum in a 
standardized way. Rather, Elias et al. suggest that a more useful 
metaphor for thinking about scaling up interventions is a sailing 
analogy in which various elements of the environment can take a 
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toll on a successful voyage and thus call to the forefront the skill 
of the sailors in navigating the environment. In addition, given the 
long history of local control of education in the United States, each 
State, district, and even school may have a unique cultural, orga- 
nizational, and educational context (Stemler & Bebell, 2012). 
Although there is currently a movement toward the development 
of Common Core Standards in education in an effort to reduce 
some of the variability in curricular issues, this will not address all 
of the systemic variability that can impact efforts to scale up 
educational interventions. 

Given the rigor strived for but not necessarily fully attained in the 
current investigation, our data suggest that it may be time to abandon 
the illusion that economies of scale should be pursued in the context 
of educational interventions. Instead, alternate models such as those 
being embraced by various teacher education programs throughout 
the country currently appear to us to be more promising. These 
models take a very different approach in which the implementation is 
tightly monitored and supported and in which new organizations 
wishing to join must be evaluated for the relevance of their contextual 
characteristics. 


Implementation Fidelity at Scale: SI Dynamics 


Traditional higher level teaching interventions, like training for 
memory skills, are formidable interventions against which to pit new 
teaching approaches for a number of reasons. First, traditional, 
memory-based strategies are the ones teachers may be expected to 
revert to in uncertain situations (e.g., when attempting to implement a 
new teaching philosophy for the first time). It takes time for teachers 
to acclimate themselves to a new philosophy, and two days of teacher 
training, although the most we could request, simply may not be 
sufficient. Second, given that the SI condition includes traditional 
memory and critical thinking aspects, as well as creative and practical 
ones, it may be possible for teachers to focus on more traditional 
aspects and still feel they are appropriately adhering to the SI condi- 
tion. Third, it is important to remember that the unit content was 
identical across all conditions. The differences between intervention 
conditions were in the framing of the teacher training, which included 
differential instruction in the underlying philosophy of SI, M, or CT, 
as appropriate. Furthermore, the curriculum content across all units 
and conditions was strong and well structured enough to provide 
engaging activities aimed at facilitating knowledge acquisition in the 
specific domain regardless of the intervention framing. Fourth, just as 
it is expected to take time for teachers to acclimate themselves to the 
SI philosophy, students also need time to adjust to differences in 
instruction (Jeltova et al., 2011). Finally, many of the content areas 
chosen for the units inherently required analytic skills and the mem- 
orization of facts. This is certainly true for the Mathematics units and 
to a lesser extent the Science units. However, it is also true of the 
Language Arts units. 

It also is possible either that the SI model does not work 
effectively for all the conditions we studied or that our realization 
of it was less than fully effective. It would take further research to 
elicit a more definitive answer to such questions. 


Limitations 


A study such as this one obviously has its limitations. We 
consider population issues, cost-benefit issues, and teacher and 
student issues that impact on fidelity of implementations. 


Population issues. All students were fourth graders, and only 
three academic subjects were used. The sheer scale of the study 
practically ensured that some implementation sites would have 
higher fidelity than others. In addition, given that the study un- 
folded in nine states across the country, it was impossible to utilize 
a single standardized achievement test across all study groups and 
all domains. A measure of overall achievement (i.e., an end-of- 
year standardized achievement test) would have provided an al- 
ternative test of effectiveness. 

Cost-benefit analyses. As innovation and change are 
costly, a fair question to consider is the cost—benefit analyses 
that compare the obtained gain in achievement to the costs of 
introducing a change in instruction. This question has not been 
the focus of investigation in studies introducing cognitive 
theories of learning-based approaches to classrooms, and the 
theory of successful intelligence is not to be excluded. 
Nevertheless, we are not prepared to conclude just yet that 
cognitive-based interventions, including those grounded in the 
theory of successful intelligence, generally do not lead to suf- 
ficient enhanced student achievement to be worth the effort. 
This is in part because the specific advantages of cognitively 
based interventions may interact with content, school-level 
variables and the scale of the implementation in complex and 
dynamic ways. 

Insights from the present efforts to upscale an instructional 
intervention within the context of an experimental study are 
consistent with those stated in the literature. First, teacher 
buy-in plays a critical role in the success of any curricular 
intervention. Throughout the year, teachers inevitably faced 
many external demands that compromised their ability to com- 
plete all of the intended units. Second, when working with 
intact classrooms, there are potential confounds that can creep 
into study design. In the case of the current study, there are 
examples in which the instructional condition is confounded 
with a particular type of classroom (e.g., gifted classrooms that 
received memory-based instruction). Such anomalies cannot be 
co-varied out. 

Teacher and student issues. We found perhaps the most 
challenging aspect of the study to be teachers’ differential 
comfort levels with various instructional methods. Even though 
teachers were trained in the teaching method they would use, 
when under stress, we might expect some teachers to revert to 
what is easiest and most familiar. Under the pressures of 
day-to-day teaching over the long term, which poses different 
demands than either teaching for a laboratory experiment or 
teaching for a short-term study, even teachers who are well 
trained in a new method may find themselves reverting to older, 
more familiar methods that they can use without the constant 
vigilance and concentration required of new interventions. They 
revert because they are under so many other pressures: class- 
room management issues, parental pressures, and administra- 
tive mandates that they need to confront at the same time. 
Fidelity to treatment method thus becomes an issue, and such 
violations of fidelity are particularly difficult to control in the 
context of a large-scale study such as ours. 

A further issue is students’ own comfort with different methods 
of instruction. Students, like teachers, are simply much more 
familiar with memory-based instruction than with other methods 
used in the teaching/learning process. Because the students’ mental 
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resources often are split between listening to the teacher, thinking 
about and planning for events going on in their extracurricular 
lives, and engaging in the social context of the classroom, they as 
well may find it easier to relate to traditional teaching than to novel 
methods of instruction. 


Summary 


In sum, the results of this large-scale, multistate study suggest that 
there are difficulties associated with scaling up educational interven- 
tions that have been demonstrated to be effective in smaller contexts. 
Implementation of the curricular materials was designed and imple- 
mented with a minimal level of support from the research team, and 
the student achievement results revealed that the impact of the cur- 
riculum on student performance, when compared with strong peda- 
gogical approaches involving teaching for memory and/or critical 
thinking, as well as with “teaching-as-usual” approaches, was heter- 
ogeneous. The results suggest that SI instruction does lead to student 
achievement outcomes that are, at a minimum, generally equivalent to 
those associated with other strong instructional interventions. Overall, 
the effects were weak, and the pattern of influence of the school and 
classroom covariates on posttest performance differed across inter- 
ventions and units. Across the domains of literature, mathematics, and 
science, enhanced student performance was observed in only 7 out of 
23 comparisons. SI was advantageous in four cases. There was one 
case where M was advantageous and two cases in favor of CT. 

The traditional approach would be to conduct more rigorous, lab- 
like investigations into SI effectiveness; consequently, smaller repli- 
cations of this study in different contexts might be called for. Or, it 
might be suggested that we investigate our critical thinking and 
memory interventions more rigorously. However, it is important to 
recognize that such rigor, by definition, introduces into the investiga- 
tion constraints that are not feasible in real, intact classrooms— 
constraints we specifically set out to free in the current study. 

It is important to place the data, results, and related discourse 
presented here in the larger context of the relevant literatures 
and question whether we as a research group, and the discipline 
in general, are going about such investigations the wrong way. 
Should implementation of interventions be tightly monitored 
and supported and participation eligibility be evaluated for 
relevance of contextual characteristics? These questions need 
deep reflection. The following observations seem to be impor- 
tant. 

First, even if a particular instructional approach has gener- 
ated robust evidence pertaining to its efficacy and replication, 
this does not mean that the scaling it up will be as effective as 
its more controlled, smaller scale evaluations. We argue that 
such a diffusion of the promise of an intervention is linked, 
primarily, to contextual factors, both systematic and random, 
influencing the context in which the intervention is scaled. This 
observation is relevant not only to the work presented here but 
to many other educational interventions. Second, it appears that 
scaled-up interventions may be characterized by a decrease of 
effect sizes observed in more controlled evaluations of the 
efficacy and robustness of an experimental pedagogy. Third, 
systematic efforts are needed (a) to characterize and parame- 
terize contextual factors that threaten the consistency of an 
intervention when scaling up and (b) to quantify the expected 
decrease on previously reported intervention effect sizes. These 


issues should be factored into the cost-benefit analyses of 
implementing change in education and should inform policy 
decision making. In such analyses and decisions, the empirical 
challenges to an innovation should be considered along with the 
humanistic and societal values and the ever-changing demands 
of the labor market. Factors such as these often do not wait for 
the relevant rigorous studies to be completed in a time compa- 
rable to the dynamics of real life. 
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Appendix 


Technical Issues in the Handling of Data 


This Appendix describes the technical details regarding the 
handling of data. Participating teachers were instructed to label and 
package all student materials in a particular way and send the 
materials to Yale. A set of materials from one test (pre- or post-) 
from one teacher from one unit was called a “package.” For a 
package to be processed and entered into the database, the student 
workbook had to meet the fidelity standards (see above). In col- 
laboration with the Yale University Social Science Statistical 
Laboratory, an ACCESS database template was developed. This 
template was used to build separate databases for each of the 4 
years of data collection. Each database was used to (a) inventory, 
or log, the materials received from the teachers; (b) track the 
materials as they were sent to coders to score; and (c) store test 
data and demographic information. The four databases were 
housed on the central PACE server, with file access restricted to 
members of the project team. 


Database Structure 


The structure of the database was rather complex and contained 
several types of tables, as described below. 

Participant information tables. Four tables contained non- 
test information about the different participants in the study: stu- 
dents, teachers, schools, and districts. Unique ID numbers were 
given to each element within a table (e.g., each packet was given 
a unique ID in the packet table). Each teacher was given a different 
teacher ID number for each school at which he or she taught during 
that year; hence, some teachers were given more than one unique 
ID. In most cases, the information in these tables was entered into 
the database before any assessment data were collected. 

Coding administration tables. Tracking test materials was a 
particularly challenging part of administering a large-scale project 
that involved continuous receipt and scoring of tests. Two impor- 
tant database processes were involved. The first process, material 
logging, was used to inventory the completed assessments received 
by the PACE Center. The second process, material checkout, was 


used to track the assessments as they were given to coders to score. 
Four Access tables were involved, and information was continu- 
ally added to these tables as part of the material logging and 
checkout processes. First, upon receiving student materials from 
the schools and/or teachers, PACE research assistants “logged in” 
the materials to the Access database for the appropriate year of 
data collection. The logging process consisted of assigning tests of 
similar type (e.g., Geometry pretests) from a single teacher to a 
“packet” and creating inventories of materials received by the 
PACE Center. 

A packet was considered both (a) an envelope containing a 
collection of one particular test type for ail the students associated 
with one teacher and (b) an Access database unit that identified 
this collection of student tests. A system of Access forms was 
developed to allow a research assistant simultaneously to add 
information to two tables that inventoried and tracked the packets 
and tests. The first table was a “material” table used as an inven- 
tory of test materials received by the center. The second table was 
a “packet” table that was used to assign a packet number to a 
packet and track it as it was sent to coders to score. A packet that 
was successfully logged in was then ready to be rated by a coder. 
Second, via a system of queries, packets were selected and as- 
signed to coders to rate. A “checkout” table tracked when each 
coder checked out and returned each packet. In addition, a “coder” 
table was maintained that contained a list of each coder, his or her 
unique coder ID, and notes about the coder. Queries and forms 
were used simultaneously to update these tables as coders com- 
pleted the agreement process and as packets were assigned to 
coders. 

Data tables. When a coder was assigned a packet to score, 
she or he was also given a scoring template for the packet: an 
Excel spreadsheet used to record multiple-choice item re- 
sponses and open-ended item ratings for each student. These 
templates contained a row for each student, with columns 
corresponding to the ratings needed for each item and additional 
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columns containing identifying information (IDs for the packet, 
student, type of test, and coder). Each scoring template con- 
tained columns for only one test type (e.g., Geometry pretests) 
that corresponded to the columns of an Access data table: data 
for each test type were stored in separate Access data tables. 
The coder returned an electronic (Excel) and/or a paper copy of 
the completed scoring template to the National Science Foun- 
dation (NSF) team. If the coder submitted only a paper copy, it 
was given to data entry personnel to enter into the Excel 
template (with this latter option reserved for skilled coders who 
had little computer access or expertise). Information from the 
completed Excel scoring template was then directly uploaded 
into an Access data table by copying the data cells of the Excel 
sheet and pasting them into the Access table. 

Quality control. Measures were taken to monitor the quality 
of the ratings during data collection. These measures included 
limiting database access to a small number of the most experienced 
personnel, using data-validation controls to prevent the entry of 
out-of-range values, supervising the coders carefully after their 
training was completed, and maintaining problem logs in the 
database. 

Limiting the number of Access users. Access to the database 
was limited to only a small core of management personnel to 
ensure participant confidentiality and to minimize the possibility 
of human error. The databases were stored on a central server and 
required network permissions to be viewed or modified. For most 
of the study, only a small number of our most technologically 
sophisticated personnel were allowed access to the database to 
check, upload, and clean data. At times, the number of people 
working simultaneously on coding exceeded 20 trained coders. 
Rather than having all of these coders enter their ratings into the 
Access data tables directly, we introduced a middle step between 
rating and Access data entry. Coders’ ratings were entered into 
Excel, as described above, and then given to the core database 
managers to upload. 

Excel template and Access table validations. Two related 
measures were taken to prevent the entry of out-of-range values 
into the Access databases. First, cell validations were used in Excel 
that would allow coders to enter only legitimate ratings. Legitimate 
ratings included codes used to designate an omitted or illegible 
response to an item (1.e., 6 or e for omitted responses, and 7 or f 
for illegible responses). A second layer of protection was also used 
to prevent the uploading of empty (unrated or unrecorded) data 
cells into the Access database and to serve as a second check for 
out-of-range values. Validation rules were eventually implemented 
in all Access data tables to prevent the uploading of missing or 
out-of-range values. When a core NSF research assistant could not 
upload the data from a coder’s template because of an out-of-range 
or missing value, the paper copy of the template and/or the coder 
was consulted to find the true value of that rating. 

Coder supervision. Coders were not permitted to score tests 
until they reached an acceptable level of initial interrater reliability 
with their coding partner (i.e., the correlations between the pair’s 
open-ended item ratings were greater than 0.70). Coders who 


reached this criterion then began coding tests independently from 
their partners; coders who were not able to establish acceptable 
levels of interrater reliability were not permitted to continue on the 
project. Of the 90 coders who began the training and agreement 
process, only 76 were permitted to score tests for the study. Core 
personnel maintained weekly contact with active coders after ini- 
tial interrater reliability was reached. They maintained the quality 
of the ratings by being available to answer coders’ questions about 
scoring and by reminding coders of the scoring guidelines when 
the coder’s ratings were discovered to have violated validation 
rules. The design of the study also allowed for the discovery of 
coder irregularities throughout the scoring process. As a quality 
control check, over thirty percent of the tests each coder scored 
were also scored by another coder. These overlapping ratings were 
used to detect discrepancies between coders and to flag coders who 
were having particular difficulty. Data from two of the 76 coders 
were deleted due to continued discrepancies with other coders. In 
addition, during the final stage of data cleaning before analysis, a 
random 10% of codings were spot-checked against the hard copies 
of the assessments. The 74 remaining coders were diverse with 
respect to their genders, ages, educational backgrounds, and test 
coding experience. They ranged in age from 18 to 66 and included 
research assistants, undergraduate student workers, temporary 
part-time employees, and PhD-level research scientists. Many cod- 
ers had previous experience scoring tests, and some had experience 
creating scoring rubrics for tests. More than 20,000 pre- and 
posttests including over 400,000 items were read and rated by 
these raters. All pre- and posttests packets had two raters, who 
used written rubrics to evaluate the quality of children’s responses. 
Each pair trained together on one packet: The two raters in the pair 
rated identical tests, their scores were then compared to establish 
interrater reliability, differences in scoring were pointed out and 
the rater pair discussed responses until agreement was reached. 
They were then sent back with the packet and their new ratings 
checked for reliability. Training was conducted until pairs reached 
an agreement of .70, which was treated as the minimum acceptable 
level; when the agreement was reached, the raters read and rated a 
common overlapping set of materials (representing 1/3 of all the 
tests rated by the pair) and then each rater read and rated separate 
sets. The quality of the data and interrater agreement was moni- 
tored in an ongoing fashion. Rater biases were carefully evaluated. 

Problem note tables. During the last year of data collection, 
tables were created in Access to centralize notes made on test 
administration quirks (e.g., a teacher photocopying all but a page 
of a test). One table was used to record problems with a particular 
student; another table was used to record quirks that affected the 
entire classroom. Both tables made note of what was done to 
correct the problem. These notes were used to ensure that common 
problems were treated consistently. These tables were used to 
determine the usability of the data for final analyses. 
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and satiety cues. Within this context, the chapters review evidence-based school 
interventions in nutrition, self-regulation, exercise, body acceptance, media literacy, 
and mindfulness. Guidance is also provided for identifying, referring, and supporting 
students with emerging eating disorders. 
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Over the past two decades, a growing body of scholarship of teaching and learning (SoTL) has 

| emerged. This empirical study of teaching methods, course design, and students’ study practices 
- has yielded invaluable information about how teachers teach and learners learn. Yet, university 

| faculty members remain largely unaware of the findings of SoTL research. As a result, they tend 
| to choose their teaching techniques and tools based on intuition and previous experience rather 
: than on scientific evidence of effectiveness. 


f| This book synthesizes SoTL findings to help teachers choose techniques and tools that maximize 
_ student learning. Evidence-based recommendations are provided regarding teacher—student 

@ rapport, online teaching, use of technology in the classroom (such as audience response systems, 
| podcasting, blogs, and wikis), experiential learning (such as internships, teaching assistantships, 

| research assistantships, and in-class research projects), students’ study habits, and more. 





In order to stimulate future Sol'L research, the book also recommends numerous areas for future investigation. It concludes 
with advice for documenting teaching effectiveness for tenure review committees. 


Both novice and experienced university teachers will find this book useful, as well as professionals who work in faculty 
development centers. 2012. 168 pages. Paperback. 
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‘The impulse to investigate the natural world is deeply rooted in our earliest childhood 
experiences. This notion has long guided researchers to uncover the cognitive mechanisms 
underlying the development of scientific reasoning in children. 


Until recently, however, research in cognitive development and education followed largely 
independent tracks. A major exception to this trend is represented in the multifaceted 
work of David Klahr. His lifelong effort to integrate a detailed understanding of children’s 
reasoning and skill acquisition with the role of education in influencing and facilitating 
scientific exploration has been essential to the growth of these fields. 


In this volume, a diverse group of stellar contributors follow Klahr’s example in examining 
the practical implications of our insights into cognitive development for children in the 
classroom. Authors discuss such wide-ranging ideas as the evolution of “folk science” in 
young children and the mechanisms that underlie mathematical understanding, as well as 
mental models used by children in classroom activities. 


The volume’s lessons will have profound implications for STEM education, and for the next generation of scientists. 
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he emotional and behavioral problems of students in the classroom are a major 
concern for teachers, administrators, and the public. This book provides school 
psychologists, counselors, social workers, school administrators, and teachers with 
a summary of ecologically sound primary, secondary, and tertiary prevention strategies. 2009. 
350 pages. Hardcover. G Credit: 8 
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